Prompt Evaluation
Build evals to measure prompt quality systematically — before and after every change.
Why Evals Matter
Without evals, you're flying blind. You can't know if your prompt changes helped or hurt, and you can't catch regressions before they reach production.
Evals are the testing framework for prompts. Build them before you start iterating, not after.
The Eval Dataset
A good eval dataset has:
- 20–100 examples covering your input distribution
- Clear expected outputs (or acceptance criteria)
- Edge cases and adversarial inputs
- Representative samples from real usage
Building this dataset is the hardest part. Invest in it — a bad eval is worse than no eval.
Types of Evals
Exact match — For classification and extraction tasks. The output must match the expected label. Fast, cheap, unambiguous.
Semantic similarity — For generation tasks where wording can vary. Use embedding distance or another model to check if meaning matches.
LLM-as-judge — Use a strong model to evaluate outputs against your criteria. More expensive but handles nuanced quality dimensions.
Human review — Gold standard for tasks where quality is subjective. Use for calibration and spot-checking automated evals.
An Eval Framework in Practice
def run_eval(prompt_template, examples):
results = []
for example in examples:
output = call_model(prompt_template.format(**example.inputs))
score = judge(output, example.expected)
results.append(score)
return sum(results) / len(results)
# Run before and after every change
baseline_score = run_eval(old_prompt, eval_dataset)
new_score = run_eval(new_prompt, eval_dataset)
print(f"Delta: {new_score - baseline_score:+.2%}")
Setting Success Thresholds
Define acceptable quality before you start: "We ship when 90% of eval examples score 4/5 or higher." This prevents perpetual iteration and anchors decisions in data.