Documentation Index
Fetch the complete documentation index at: https://kensa.sh/docs/llms.txt
Use this file to discover all available pages before exploring further.
Directory layout
Everything lives under.kensa/ in your project root:
Scenarios
A scenario defines a single test case: an input, a command to run, and how to evaluate the result.code or traces) or written by hand (source: user). See Scenarios for the full field reference.
Spans and traces
When a scenario runs, kensa launches the agent in a subprocess and auto-instruments it. LLM calls, tool use, and timing are captured as spans and written as JSONL. A span represents one unit of work:| Kind | What it captures |
|---|---|
llm | Model call - model name, tokens, cost, latency |
tool | Tool invocation - name, arguments, result |
agent | Top-level agent span (parent of all others) |
chain | Orchestration step (e.g. LangChain chain) |
retriever | RAG retrieval step |
evaluator | Evaluation or scoring step emitted by the traced system |
Checks
Checks are deterministic, free, and fast. They answer binary questions about the execution:- Output checks: Does the output contain a string? Match a regex?
- Tool checks: Was a tool called? In the right order? Without duplicates?
- Trajectory checks: Did the overall tool-call path match the expected sequence and stay within inline budgets?
- Resource checks: Under cost limit? Under turn limit? Under time limit?
Judge
The judge is an LLM that evaluates subjective criteria against the execution trace. It receives the scenario input, expected outcome, agent output, tool calls, and your criteria. It returns binary pass/fail with written reasoning. Two ways to define criteria:- Inline: Write criteria directly in the scenario’s
criteriafield. - Structured: Define a reusable judge spec in
.kensa/judges/with pass/fail definitions and few-shot examples. Reference it via the scenario’sjudgefield.
criteria and judge are mutually exclusive. See Judge for model resolution and spec format.
Results
A result combines check outcomes and judge verdict into a single status:| Status | Meaning |
|---|---|
pass | All checks passed AND judge passed (or no criteria set) |
fail | At least one check failed, or the judge failed |
error | Agent crashed or timed out |
uncertain | Judge couldn’t reach a verdict |
.kensa/results/ and can be rendered as terminal output, markdown, JSON, or a standalone HTML dashboard.
Runs and aggregation
Eachkensa run invocation produces a run manifest, a record of which scenarios ran, their trace paths, exit codes, and durations.
Repeated runs give you more traces to inspect with kensa analyze, which currently surfaces
trace-level cost and latency distributions, tool usage, success rate, and anomaly flags.
When a scenario uses a trajectory check, result reports also surface
trajectory_accuracy and step_efficiency alongside pass/fail.
Evaluation pipeline
The full pipeline in one diagram:kensa eval runs all three stages in sequence. Each stage can also be invoked independently for debugging or CI integration.