Skills

Say “evaluate my agent” in Claude Code or any skill-aware coding agent, and one of five skills picks the next step: audit setup, generate scenarios, validate the judge, run, or diagnose failures. Each skill drives the CLI under the hood.

Installation

uvx kensa init

Adds kensa to your dev deps, scaffolds .kensa/, and prompts you to choose which coding agent to install skills for. Works with Claude Code, Codex, Cursor, OpenCode, Gemini CLI, and more. Use kensa skills install later to refresh after a kensa upgrade.

kensa skills install -a claude
kensa skills install -a codex
kensa skills install -a cursor
kensa skills install -a opencode
kensa skills install -a gemini
kensa skills install -a all

Claude Code uses .claude/skills. Codex, Cursor, OpenCode, Gemini CLI, and other use the open Agent Skills directory at .agents/skills.

Lifecycle

Setup (audit-evals) → Design (generate-scenarios) → Calibrate (generate-judges)
  → Validate (validate-judge) → Execute (kensa eval) → Diagnose (diagnose-errors) → Iterate

/audit-evals

The default entry point. Assesses readiness, identifies testable behaviors, and prepares the environment. What it does:

Checks kensa installation
Determines current state (scenarios exist? traces exist?)
Scans codebase for entry point, SDK, tools, behaviors, env vars
Verifies instrumentation with kensa doctor
Routes to the appropriate next skill

/generate-scenarios

Generates test scenarios covering five categories:

Happy path - expected behavior with valid inputs
Tool usage - correct tool selection and ordering
Edge cases - boundary conditions, unusual inputs
Error handling - graceful failure, meaningful error messages
Cost/latency bounds - resource usage stays within limits

Outputs YAML files to .kensa/scenarios/.

/generate-judges

Creates structured judge prompts for subjective evaluation criteria:

Binary pass/fail definitions (no Likert scales)
2-4 few-shot examples with critiques
Designed for reuse across scenarios

Outputs YAML specs to .kensa/judges/.

/validate-judge

Tests judge accuracy against human-labeled examples:

Requires 8-20 labeled examples
Measures TPR (true positive rate) and TNR (true negative rate)
Target threshold: both ≥ 90%
Iterates on the judge prompt until thresholds are met
Optional bootstrap resampling for confidence intervals

/diagnose-errors

Analyzes eval results after a run:

Categorizes failures: check failures, judge rejections, errors, uncertain
Reads .kensa/results/ and .kensa/traces/
Identifies failure patterns across scenarios
Recommends next action: fix agent, improve judge, add scenarios, etc.

Getting started

Reference

Workflows

Releases

Installation

Lifecycle

/audit-evals

/generate-scenarios

/generate-judges

/validate-judge

/diagnose-errors

Getting started

Reference

Workflows

Releases

Documentation Index

​Installation

​Lifecycle

​/audit-evals

​/generate-scenarios

​/generate-judges

​/validate-judge

​/diagnose-errors

Installation

Lifecycle

/audit-evals

/generate-scenarios

/generate-judges

/validate-judge

/diagnose-errors