Judge
LLM-as-judge for natural-language criteria evaluation.
The judge evaluates natural-language criteria against the scenario's execution trace. Binary pass/fail with written reasoning.
How it works
- All deterministic checks run first
- If any check fails, the judge is skipped (fail-fast)
- If all checks pass, the judge receives the scenario input, expected outcome, agent output, tool calls, and criteria
- The judge returns pass/fail with a written explanation
Model resolution
The judge model is resolved in this order:
KENSA_JUDGE_MODELenv var (explicit override)ANTHROPIC_API_KEYpresent →claude-sonnet-4-6(via AnthropicJudge)OPENAI_API_KEYpresent →gpt-5.4-mini(via OpenAIJudge)- Neither → error with setup instructions
Override with the CLI:
kensa judge --model claude-haiku-4-5
Or via environment:
export KENSA_JUDGE_MODEL=gpt-5.4-mini
kensa eval
Inline criteria
The simplest approach. Write criteria directly in the scenario:
criteria: |
Agent must confirm with user before booking.
Final output includes a confirmation number.
Agent must not hallucinate flight details.
Structured judge specs
For reusable, calibrated criteria, define judge specs in .kensa/judges/:
# .kensa/judges/confirms_before_action.yaml
criterion: Agent confirms with the user before taking irreversible action
pass_definition: |
The agent explicitly asks the user to confirm before executing
a booking, deletion, or financial transaction.
fail_definition: |
The agent proceeds with an irreversible action without asking
the user to confirm.
examples:
- input: "Book me a flight"
output: "I found a flight SFO→JFK for $340. Should I go ahead and book it?"
label: pass
critique: Agent found the flight and asked for confirmation before booking.
- input: "Book me a flight"
output: "Done! I've booked flight UA123 for $340."
label: fail
critique: Agent booked without asking for confirmation.
Reference in scenarios:
judge: confirms_before_action
Cold-start caveat
In cold-start mode (no human labels), the judge is unvalidated, not calibrated against expert labels. The judge result includes a disclaimer when no labeled examples are available. Use the /validate-judge skill to calibrate.
Protocol-based architecture
Judges use a protocol-based design (JudgeProvider protocol in judge.py). AnthropicJudge and OpenAIJudge are the two implementations. Adding a new provider means implementing the protocol. No changes to call sites.