Back to Index
Evaluation9 minDec 2025

How to Evaluate AI Systems That Don't Have Right Answers

Traditional testing has assertions. AI doesn't. Here's how to build evaluation frameworks for non-deterministic systems.

Software engineering has spent fifty years perfecting one idea: if you give a system input X, it should produce output Y. Every time. Deterministically.

AI breaks this contract.

Ask an LLM the same question twice and you'll get two different answers. Both might be correct. Both might be wrong. Both will sound confident. This fundamental non-determinism makes traditional testing — unit tests, integration tests, assertion-based QA — insufficient for AI systems.

So how do you know if your AI is working?

The Death of assertEqual

In traditional software, testing is straightforward:

test('calculateTax returns correct amount', () => {
  expect(calculateTax(100, 0.2)).toBe(20);
});

Input, output, assertion. If the test passes, the code works. But try writing this for an AI system:

test('summarise this medical record', () => {
  const summary = await model.summarise(patientRecord);
  expect(summary).toBe(/* ...what exactly? */);
});

There is no single correct summary. There are many acceptable summaries, some better than others, and the quality of any given summary depends on completeness, accuracy, relevance, readability, and whether it preserves clinically significant details. No assertEqual can capture this.

This is the evaluation problem. And most teams solve it by not solving it — they read a few outputs, decide they "look good," and ship to production. In the industry, this is politely called manual review. It's more accurately called a vibe check.

Three Evaluation Strategies That Actually Work

There are better approaches. None of them are perfect. But combined, they provide the kind of systematic quality measurement that enterprise AI requires.

1. Golden Dataset Benchmarks

A golden dataset is a curated, expert-validated set of inputs and expected outputs that represents the full range of your system's expected behaviour. It includes normal cases, edge cases, adversarial inputs, and examples from every domain your system operates in. We cover the full engineering discipline of building these datasets in Building Ground Truth.

The key word is curated. A golden dataset isn't a random sample of production data. It's deliberately constructed to test specific capabilities and failure modes. And it's maintained over time — updated as new edge cases are discovered, as the domain evolves, as requirements change.

We typically build golden datasets of at least several thousand examples, weighted toward the edge cases that matter most. A medical AI system's golden dataset will over-represent rare conditions, ambiguous documentation, and cases where the correct answer is "I don't know."

Every model update, prompt change, or pipeline modification is benchmarked against the golden dataset before it touches production.

2. LLM-as-a-Judge

When your outputs are natural language — summaries, explanations, classifications with reasoning — you need evaluators that understand language. One effective approach is using a separate LLM as an automated judge.

The judge model receives the input, the system's output, and a rubric, then scores the output on specific dimensions: factual accuracy, completeness, relevance, tone.

But LLM judges have known failure modes. They exhibit position bias (preferring the first option presented), verbosity bias (rating longer answers higher), and sycophancy (agreeing with the system under test). Naive LLM-as-a-Judge implementations are unreliable.

We mitigate this by:

  • Requiring the judge to output structured reasoning before scoring (chain-of-thought evaluation)
  • Randomising the order of options presented to the judge
  • Cross-validating judge scores against deterministic heuristic evaluators
  • Using multiple judge models and checking for consensus

The result isn't perfect — no automated evaluation is. But it's dramatically better than manual spot-checking, and it scales to thousands of evaluations per deployment.

3. Continuous Semantic Regression Testing

This is where evaluation becomes genuinely powerful. Instead of just testing whether the current model version is "good enough," we test whether it's better or at least as good as the previous version across every capability dimension.

test('medical summarisation maintains quality', async () => {
  const goldenSet = await loadGoldenDataset('medical-summaries');
  
  const currentScores = await evaluate(CURRENT_MODEL, goldenSet);
  const candidateScores = await evaluate(CANDIDATE_MODEL, goldenSet);
  
  // Overall quality must not degrade
  expect(candidateScores.overall).toBeGreaterThanOrEqual(
    currentScores.overall * 0.98 // 2% tolerance
  );
  
  // Safety-critical dimensions have zero tolerance
  expect(candidateScores.factualAccuracy).toBeGreaterThanOrEqual(
    currentScores.factualAccuracy
  );
  
  // Flag any category-level regressions
  for (const category of goldenSet.categories) {
    const delta = candidateScores[category] - currentScores[category];
    if (delta < -0.02) {
      throw new RegressionError(category, delta);
    }
  }
});

This catches the most dangerous failure mode in AI: the model that improves overall but silently regresses in a specific area — what we call semantic drift. A prompt change that makes the model better at generating summaries might simultaneously make it worse at detecting PII. Without regression testing, you'd never know until a data breach.

"A model that's better on average but worse in safety-critical categories is not better. It's a liability."

The Evaluation Maturity Model

Most teams we work with are at Level 0 or 1. Reaching Level 3 is what separates AI demos from AI products.

Level 0 — No evaluation "We tried it a few times and it seemed fine." Outputs are reviewed manually, sporadically, by the developers who built the system. This is where most AI projects are.

Level 1 — Ad hoc benchmarks The team has assembled a test set — usually a few hundred examples — and runs it before major releases. Results are reviewed in spreadsheets. Better than nothing, but not systematic.

Level 2 — Automated evaluation in CI/CD Evaluation runs automatically on every pull request. Golden datasets exist. LLM-as-a-Judge provides automated scoring. Results are tracked over time. Regressions block merges.

Level 3 — Continuous evaluation in production Evaluation doesn't stop at deployment. Production outputs are sampled and scored continuously. Semantic drift is detected in real-time. Quality thresholds trigger alerts. The evaluation framework is as sophisticated as the AI system itself.

Why Evaluation Is the Best Predictor of AI Success

Here's a pattern we've observed consistently across dozens of enterprise AI projects: the teams that invest heavily in evaluation outperform the teams that invest in model experimentation.

It's counterintuitive. Surely the model is what matters? But in practice, evaluation is what creates the feedback loop that makes everything else work. Without rigorous evaluation, you can't diagnose problems. Without diagnosis, you can't improve. Without improvement, you're just hoping.

The teams that build serious evaluation infrastructure know exactly where their systems are strong and weak. They can quantify the impact of every change. They can make promises to stakeholders — "our factual accuracy is 97.3% and has not degraded in six months" — and back them up with data.

That's not just engineering rigour. It's trust. And in enterprise AI, trust is the only thing that matters. Explore our evaluation infrastructure →

Newsletter

Intelligence Delivered.

Technical deep-dives on AI infrastructure, evaluation frameworks, and production operations. No spam, unsubscribe anytime.

Zero spam · Unsubscribe anytime