Concepts
The mental model behind kensa. Scenarios, spans, checks, judges, and results.
Directory layout
Everything lives under .kensa/ in your project root:
.kensa/
scenarios/ YAML scenario definitions
agents/ Agent entry points (optional)
judges/ Structured judge specs (optional)
traces/ JSONL span files (generated)
runs/ Run manifests (generated)
results/ Evaluation results (generated)
reports/ HTML/markdown/JSON reports (generated)
You write scenarios and (optionally) judge specs. Kensa generates everything else.
Scenarios
A scenario defines a single test case: an input, a command to run, and how to evaluate the result.
id: classify_ticket
input: "Our entire team can't log in. SSO has returned 502 since 7am."
run_command: python agent.py {{input}}
checks:
- type: output_matches
params: { pattern: "^P[123]$" }
criteria: |
P1 is for outages affecting multiple users.
Scenarios can be generated by your coding agent (source: code or traces) or written by hand (source: user). See Scenarios for the full field reference.
Spans and traces
When a scenario runs, kensa sets KENSA_TRACE_DIR and launches the agent in a subprocess. The instrument() call in your agent configures OpenTelemetry to write spans as JSONL.
A span represents one unit of work:
| Kind | What it captures |
|---|---|
llm | Model call - model name, tokens, cost, latency |
tool | Tool invocation - name, arguments, result |
agent | Top-level agent span (parent of all others) |
chain | Orchestration step (e.g. LangChain chain) |
retriever | RAG retrieval step |
Spans form a tree via parent/child relationships. Kensa reads the JSONL after execution and translates it to an internal format for checks and judging.
Checks
Checks are deterministic, free, and fast. They answer binary questions about the execution:
- Output checks: Does the output contain a string? Match a regex?
- Tool checks: Was a tool called? In the right order? Without duplicates?
- Resource checks: Under cost limit? Under turn limit? Under time limit?
See Checks for the full list of check types.
Checks run before the judge. If any check fails, the scenario fails immediately. No tokens spent on judging.
Judge
The judge is an LLM that evaluates subjective criteria against the execution trace. It receives the scenario input, expected outcome, agent output, tool calls, and your criteria. It returns binary pass/fail with written reasoning.
Two ways to define criteria:
- Inline: Write criteria directly in the scenario's
criteriafield. - Structured: Define a reusable judge spec in
.kensa/judges/with pass/fail definitions and few-shot examples. Reference it via the scenario'sjudgefield.
criteria and judge are mutually exclusive. See Judge for model resolution and spec format.
Results
A result combines check outcomes and judge verdict into a single status:
| Status | Meaning |
|---|---|
pass | All checks passed AND judge passed (or no criteria set) |
fail | At least one check failed, or the judge failed |
error | Agent crashed or timed out |
uncertain | Judge couldn't reach a verdict |
Results are stored as JSON in .kensa/results/ and can be rendered as terminal output, markdown, JSON, or a standalone HTML dashboard.
Runs and aggregation
Each kensa run invocation produces a run manifest, a record of which scenarios ran, their trace paths, exit codes, and durations.
Running the same scenarios multiple times enables:
- Variance detection: Are results consistent or flaky?
- Pass rate tracking: What percentage of runs pass each scenario?
- Cost/latency stats: Mean, stddev, min, max across runs.
- Anomaly flagging: Cost outliers, latency outliers, repeated tool calls, high turn counts.
Use kensa analyze to surface these stats.
Evaluation pipeline
The full pipeline in one diagram:
scenarios → run (subprocess + traces) → judge (checks + LLM)
→ results → report
kensa eval runs all three stages in sequence. Each stage can also be invoked independently for debugging or CI integration.