Say “evaluate my agent” in Claude Code or any skill-aware coding agent, and one of five skills picks the next step: audit setup, generate scenarios, validate the judge, run, or diagnose failures. Each skill drives the CLI under the hood.Documentation Index
Fetch the complete documentation index at: https://kensa.sh/docs/llms.txt
Use this file to discover all available pages before exploring further.
Installation
kensa to your dev deps, scaffolds .kensa/, and prompts you to choose which coding agent to install skills for. Works with Claude Code, Codex, Cursor, OpenCode, Gemini CLI, and more. Use kensa skills install later to refresh after a kensa upgrade.
.claude/skills. Codex, Cursor, OpenCode, Gemini CLI, and other use the open Agent Skills directory at .agents/skills.
Lifecycle
/audit-evals
The default entry point. Assesses readiness, identifies testable behaviors, and prepares the environment. What it does:- Checks kensa installation
- Determines current state (scenarios exist? traces exist?)
- Scans codebase for entry point, SDK, tools, behaviors, env vars
- Verifies instrumentation with
kensa doctor - Routes to the appropriate next skill
/generate-scenarios
Generates test scenarios covering five categories:- Happy path - expected behavior with valid inputs
- Tool usage - correct tool selection and ordering
- Edge cases - boundary conditions, unusual inputs
- Error handling - graceful failure, meaningful error messages
- Cost/latency bounds - resource usage stays within limits
.kensa/scenarios/.
/generate-judges
Creates structured judge prompts for subjective evaluation criteria:- Binary pass/fail definitions (no Likert scales)
- 2-4 few-shot examples with critiques
- Designed for reuse across scenarios
.kensa/judges/.
/validate-judge
Tests judge accuracy against human-labeled examples:- Requires 8-20 labeled examples
- Measures TPR (true positive rate) and TNR (true negative rate)
- Target threshold: both ≥ 90%
- Iterates on the judge prompt until thresholds are met
- Optional bootstrap resampling for confidence intervals
/diagnose-errors
Analyzes eval results after a run:- Categorizes failures: check failures, judge rejections, errors, uncertain
- Reads
.kensa/results/and.kensa/traces/ - Identifies failure patterns across scenarios
- Recommends next action: fix agent, improve judge, add scenarios, etc.