How Formal Verification Powers AI Agent Evaluation (Without the PhD)
Why vibes-based agent testing doesn't scale โ and how mathematical guarantees change the game
The Evaluation Problem
You have 50 agents making 10,000 decisions per day. How do you know they're working correctly?
Here's how most organizations answer that question today:
The evaluation gap is real: you cannot test agents the way you test software. Agent behavior is non-deterministic. It depends on context, input phrasing, conversation history, and the interaction between multiple agents in a pipeline. Traditional testing assumes deterministic functions with known inputs and expected outputs. Agents don't work that way.
So what does?
The ConceptDB Approach
ConceptDB's evaluation engine is built on category theory โ a branch of mathematics that provides guarantees about when two processes are truly equivalent. In practical terms, this means ConceptDB can verify that an agent pipeline produces the same quality results regardless of which path it takes through your systems.
Here's what that means without the math:
Instead of writing end-to-end tests for every scenario, define what "correct" means as composable properties. They describe what must always be true, not what happens with one specific input.
Evaluations are composable properties. Instead of writing end-to-end tests that try to cover every scenario, you define what "correct" means for individual agent behaviors. "Always verify customer identity before accessing billing." "Never disclose another customer's information." "Extraction must capture all required fields from the source document." These are properties, not test cases. They describe what must always be true, not what happens with one specific input.
Properties compose into pipeline guarantees. When you define properties for each agent in a pipeline, ConceptDB composes them into end-to-end evaluations automatically. You don't write separate tests for every possible combination of agent interactions. The mathematical foundation guarantees that if each step satisfies its properties, the pipeline satisfies the composed properties. No gaps. No assumptions. No "it works because the tests pass."
Verification runs continuously. These aren't one-time checks. ConceptDB evaluates every agent execution against its defined properties, in production, at scale. When a property is violated, you know immediately โ not after a customer files a complaint.
A Concrete Example: Insurance Claims Pipeline
Abstract principles need concrete proof. Here's how this works with a real pipeline: four agents processing insurance claims.
Individual Agent Evaluation
ConceptDB evaluates each agent against its defined properties:
Intake Agent
Property: All required fields extracted from source document
Result: 99.2% compliance across 45,000 claims
Violations: 360 claims with missing fields
Breakdown: Policy number (99.9%), date (99.7%), damage description (98.1%)
Verification Agent
Property: Every coverage determination cites specific policy language
Result: 99.8% compliance
Violations: 90 claims with unsupported determinations
Assessment Agent
Property: Payout calculation matches policy formula within $0.01
Result: 99.5% compliance
Violations: 225 claims with calculation discrepancies
Communication Agent
Property: Letter includes all legally required disclosures
Result: 99.9% compliance
Violations: 45 letters missing appeal rights languageEach agent looks strong in isolation. But the pipeline tells a different story.
Pipeline Evaluation
ConceptDB doesn't stop at individual agents. It evaluates how agents interact:
End-to-End Pipeline
Property: Claim processed correctly from document to correspondence
Result: 97.8% compliance across 45,000 claims
Gap: 2.2% failure rate (990 claims)97.8% sounds high. But 990 incorrect claims per 45,000 is a regulatory problem. Where are the failures coming from?
Interaction Effects
This is where composable evaluation reveals what isolated testing never would:
Interaction Analysis
Intake misses a field โ Verification catches it:
Recovery rate: 94%
These errors are self-correcting.
Intake misses a field โ Verification misses it too โ Assessment propagates:
Propagation rate: 67%
These errors produce incorrect payouts.
Verification flags a coverage exclusion โ Assessment ignores the flag:
Ignore rate: 8%
These errors produce payouts that should have been denied.
Assessment calculates correctly โ Communication misquotes the amount:
Misquote rate: 0.3%
Low frequency, high severity โ legal liability in every instance.Now you know where to focus. The Intake-to-Assessment error propagation path accounts for the majority of end-to-end failures. Fixing that one interaction pattern โ teaching the Assessment Agent to halt when upstream fields are missing โ reduces the overall error rate from 2.2% to 0.7%.
The Intake-to-Assessment error propagation path accounts for the majority of end-to-end failures. Fixing that one interaction pattern reduces the overall error rate from 2.2% to 0.7%.
You found this without reading a single trace manually.
Composable Evaluation
The insurance example illustrates the core innovation, but the principle extends to any agent architecture.
Traditional evals test one agent at a time, in isolation. You evaluate the Intake Agent with test documents. You evaluate the Verification Agent with test claims. Each agent scores well. You deploy the pipeline. Failures appear that no individual eval predicted โ because the failures live in the interactions between agents, not inside any single agent.
ConceptDB tests agents as they actually work: in pipelines, with real data flowing between them. The evaluation is not "does this agent produce good output for this input?" The evaluation is "does this pipeline satisfy these properties across all the data it has processed?"
Properties compose upward. You define properties at the behavior level โ individual rules that each agent must follow. ConceptDB composes these into agent-level evaluations, then into pipeline-level evaluations, then into system-level evaluations. At each level, the mathematical foundation guarantees that the composed evaluation is complete. If the individual properties cover the individual behaviors, the composed evaluation covers the pipeline. No manual integration testing. No hoping that your end-to-end tests caught every combination.
Adding agents doesn't break existing evaluations. When you add a fifth agent to the insurance pipeline โ say, a Fraud Detection Agent between Verification and Assessment โ existing properties still hold. You define properties for the new agent and its interactions with adjacent agents. ConceptDB recomposes the pipeline evaluation automatically. The properties you defined for Intake, Verification, Assessment, and Communication don't change. The overall pipeline evaluation now includes the new agent without requiring you to rewrite anything.
Swapping models doesn't require re-evaluation from scratch. When you upgrade the Assessment Agent from one model to another, ConceptDB evaluates the new model against the same properties. Did accuracy hold? Did the interaction patterns change? You get a direct comparison โ same properties, different implementation โ with mathematical rigor behind the verdict.
The ROI
Mathematical guarantees are interesting. Business impact is what matters.
Go Deeper
- See how ConceptDB provides visibility into agent behavior across your organization
- Learn how your business dictionary defines the standards your agents are measured against
Your AI. Your Data. Your Rules.
Your agents. Proven correct.