Prompt Evaluation

Why Evals Matter

Without evals, you're flying blind. You can't know if your prompt changes helped or hurt, and you can't catch regressions before they reach production.

Evals are the testing framework for prompts. Build them before you start iterating, not after.

The Eval Dataset

A good eval dataset has:

20–100 examples covering your input distribution
Clear expected outputs (or acceptance criteria)
Edge cases and adversarial inputs
Representative samples from real usage

Building this dataset is the hardest part. Invest in it — a bad eval is worse than no eval.

Types of Evals

Exact match — For classification and extraction tasks. The output must match the expected label. Fast, cheap, unambiguous.

Semantic similarity — For generation tasks where wording can vary. Use embedding distance or another model to check if meaning matches.

LLM-as-judge — Use a strong model to evaluate outputs against your criteria. More expensive but handles nuanced quality dimensions.

Human review — Gold standard for tasks where quality is subjective. Use for calibration and spot-checking automated evals.

An Eval Framework in Practice

def run_eval(prompt_template, examples):
    results = []
    for example in examples:
        output = call_model(prompt_template.format(**example.inputs))
        score = judge(output, example.expected)
        results.append(score)
    return sum(results) / len(results)

# Run before and after every change
baseline_score = run_eval(old_prompt, eval_dataset)
new_score = run_eval(new_prompt, eval_dataset)
print(f"Delta: {new_score - baseline_score:+.2%}")

Setting Success Thresholds

Define acceptable quality before you start: "We ship when 90% of eval examples score 4/5 or higher." This prevents perpetual iteration and anchors decisions in data.

Why Evals Matter

The Eval Dataset

Types of Evals

An Eval Framework in Practice

Setting Success Thresholds

Prompt Engineering