Evals: The Missing Piece in AI Engineering

February 10, 2025 · 2 min read ·

Every software team has tests. Most AI teams don’t. This is a problem. If you’re deploying LLM-powered features without an evaluation framework, you have no way to know if a change improved things or made them worse.

Why Evals Matter

Traditional tests check for exact outputs. AI outputs are non-deterministic — you need a different approach. Evals score outputs along dimensions like accuracy, relevance, safety, and coherence.

Evals are to AI engineering what tests are to software engineering. You wouldn’t deploy code without tests. You shouldn’t deploy prompts without evals. The discipline is the same — only the assertions are fuzzier.

Building an Eval Pipeline

from dataclasses import dataclass

@dataclass
class EvalResult:
    accuracy: float
    relevance: float
    safety: float
    overall: float

def evaluate_output(
    prompt: str,
    output: str,
    reference: str,
) -> EvalResult:
    accuracy = score_accuracy(output, reference)
    relevance = score_relevance(prompt, output)
    safety = score_safety(output)
    overall = (accuracy + relevance + safety) / 3
    return EvalResult(
        accuracy=accuracy,
        relevance=relevance,
        safety=safety,
        overall=overall,
    )

def run_eval_suite(
    prompt_template: str,
    test_cases: list[dict],
    threshold: float = 0.8,
) -> bool:
    results = []
    for case in test_cases:
        output = generate(prompt_template.format(**case["input"]))
        result = evaluate_output(
            case["input"]["query"], output, case["expected"]
        )
        results.append(result)
    avg_score = sum(r.overall for r in results) / len(results)
    return avg_score >= threshold

Getting Started

Don’t over-engineer it. Start with 50 golden examples — input-output pairs you’ve manually verified. Run your eval suite on every prompt change. Track scores over time. You’ll be surprised how often a “small” prompt tweak causes regressions you wouldn’t have caught manually.

Why Evals Matter

Building an Eval Pipeline

Getting Started

Related Posts

Building with LLMs: A Practical Guide

RAG Systems: Beyond the Basics

Prompt Engineering Is Software Engineering