↑ BACK TO TOP
open sidebar menu
  • AI, it's not that complicated/The Generative AI wave
    Knowledge Bases
    Analyzing Software CompaniesBuilding Software ProductsAI, it's not that complicatedWorking With Data Teams
    Sections
    1: The Basics
    2: The Generative AI wave
    It was never about LLM performanceWhat is RAG?What's a vector database?How do AI models think and reason?How to build apps with AIWhat is MCP?What is Generative AI?The beginner’s guide to AI model architecturesA deep dive into MCP and its associated serversThe scaling law and the “bitter lesson” of AIA practical breakdown of the AI power situationThe vibe coder’s guide to real coding2026 vibe coding tool comparisonHow to build AI products that are actually goodThe AI user's guide to evalsAI will replace you at your job if you let it
    3: Tools and Products
Sign In

The AI user's guide to evals

How to actually make your AI workflows, er, work.

ai

Published: December 11, 2025

Has this ever happened to you? You spent a weekend feeling like a genius because you chained together a few LLM prompts in Zapier, or maybe you built a "content generator" in your company’s internal tool like Glean. It worked perfectly for the first three tries. You showed your boss and they immediately promoted you. For the first time in your life, your father told you he was proud.

Then, on Tuesday, everything went awry: your system hallucinated a policy that doesn't exist, and a customer got a free Labubu. On Wednesday, it responded to a user in German (you are based in Ohio). By Thursday, you’re back to doing it manually because you can't trust the "magic" bot you built.

You are not entirely surprised to discover that your tool has some rough edges that need sorting out and your prompts are not entirely bulletproof. And thus, your journey into evals begins.

The TL;DR

  • Evals are just "software testing" adapted for the fuzzy, probabilistic world of AI. They measure (and help you fix) when your system screws up.
  • The most important part of evals is data analysis – you cannot measure and fix what you cannot identify. Look at your data!
  • Complex "LLM as a Judge" setups seem sexy but will probably ruin your life; avoid them until you have exhausted simple keyword checks (assertions).
  • If you don't measure your AI's performance systematically, you aren't building a tool; you're building a slot machine that occasionally pays out productivity.

Terms Mentioned

LLM

Client

Production

Hallucination

Companies Mentioned

Zapier logo

Zapier

$PRIVATE

A conceptual grounding for evals

To understand evals, we first have to take a brief detour into how normal software works.

Access the full post in a knowledge base

Knowledge bases give you everything you need – access to the right posts and a learning plan – to get up to speed on whatever your goal is.

Knowledge Base

AI, it's not that complicated

How to understand and work effectively with AI and ML models and products.

$0.00

What's a knowledge base? ↗

Where to next?

Keep learning how to understand and work effectively with AI and ML models and products.

AI will replace you at your job if you let it

A look at the thin line between using AI smartly and writing your own pink slip.

The Generative AI wave
Comparing available LLMs for non-technical users

How do ChatGPT, Mistral, Gemini, and Llama3 stack up for common tasks like generating sales emails?

Tools and Products
What does OpenAI do?

OpenAI is the most popular provider of generative AI models like GPT-4.

Tools and Products
Newsletter
Support
Sponsorships
X + Linkedin
Privacy + ToS

Written with 💔 by Justin in Brooklyn