↑ BACK TO TOP
technically logo

Learn

AI Reference
Your dictionary for AI terms like LLM and RLHF
Company Breakdowns
What technical products actually do and why the companies that make them are valuable
Knowledge Bases
In-depth, networked guides to learning specific concepts
Posts Archive
All Technically posts on software concepts since the dawn of time
Terms Universe
The dictionary of software terms you've always wanted

Explore knowledge bases

AI, it's not that ComplicatedAnalyzing Software CompaniesBuilding Software ProductsWorking with Data Teams
Loading...

Meet Technically

Technically exists to help you get better at your job by becoming more technically literate.

Learn more →

Solutions for Teams

For GTM Teams
Sell more software to developers by becoming technically fluent.
For Finance Professionals
Helping both buy-side and sell-side firms ask better technical questions.
General Team Inquiries
Volume discounts on Technically knowledge bases.
Loading...
Pricing
Sign In

The AI user's guide to evals

How to actually make your AI workflows, er, work.

ai

Published: December 23, 2025

Has this ever happened to you? You spent a weekend feeling like a genius because you chained together a few LLM prompts in Zapier, or maybe you built a "content generator" in your company’s internal tool like Glean. It worked perfectly for the first three tries. You showed your boss and they immediately promoted you. For the first time in your life, your father told you he was proud.

Then, on Tuesday, everything went awry: your system hallucinated a policy that doesn't exist, and a customer got a free Labubu. On Wednesday, it responded to a user in German (you are based in Ohio). By Thursday, you’re back to doing it manually because you can't trust the "magic" bot you built.

You are not entirely surprised to discover that your tool has some rough edges that need sorting out and your prompts are not entirely bulletproof. And thus, your journey into evals begins.

The TL;DR

  • Evals are just "software testing" adapted for the fuzzy, probabilistic world of AI. They measure (and help you fix) when your system screws up.
  • The most important part of evals is data analysis – you cannot measure and fix what you cannot identify. Look at your data!
  • Complex "LLM as a Judge" setups seem sexy but will probably ruin your life; avoid them until you have exhausted simple keyword checks (assertions).
  • If you don't measure your AI's performance systematically, you aren't building a tool; you're building a slot machine that occasionally pays out productivity.

Terms Mentioned

LLM

Client

Production

Hallucination

Companies Mentioned

Zapier logo

Zapier

$PRIVATE

A conceptual grounding for evals

To understand evals, we first have to take a brief detour into how normal software works.

In traditional engineering, code is deterministic, which is (another) fancy word that means it’s perfectly predictable. If you write a function that adds 2 + 2, the answer is 4. It is 4 today, it is 4 tomorrow, and it is 4 during a Black Friday sale. Testing this is easy: you write a "Unit Test" that asserts that the answer = 4. If the computer returns 5, the test fails, and you fix whatever embarrassing error was in the code.

AI models are different. They are stochastic. They are probabilistic engines that guess the next word in a sentence. If you ask an LLM to "summarize notes from this call" the output might be different every single time you run it.

So, how do you "test" a summary? There is no single "correct" string of text to compare it to, and there are conceivably multiple different kinds of responses that would make you happy. Perhaps you would prefer the LLM summarize the call as “a complete waste of time.”

This is where evals come in. An Eval is a system that takes the fuzzy output of AI and grades it against a standard of quality. It’s the difference between saying "I think the bot is getting better" (vibes) and "The bot’s hallucination rate dropped from 15% to 3% this week" (science). Well, it’s still not entirely science but it’s closer.

There’s a lot written about how engineering teams building AI into their products should do evals…but quite a bit less written about how you can use them in your day to day using AI tools at work.

One of the experts on the subject is Hamel Husain – he helps teams build successful AI products and teaches the AI Evals For Engineers & PMs course on Maven. I talked to him about what non-engineers can do to bring evals into the work they’re doing with AI to make their systems more foolproof. Then I wrote down what we talked about. So here is that.

The flow goes something like this:

  1. Analyze your system and figure out where things go wrong (look at the data)
  2. Create evals (simple or complex) to fix those things
  3. Rinse and repeat (or at least repeat)

Step 1: look at the data, damnit!

I’m starting this guide off really strong by picking a first step that nobody really wants to do. Many (not you, intelligent reader) probably hoped there was some software that you could drop in to fix your problems for you. Unfortunately there is not. You are just going to have to look at the data.

Newsletter
Support
Sponsorships
X + Linkedin
Privacy + ToS

Written with 💔 by Justin in Brooklyn