How it works
Your coding agent reasons: it reads your codebase, identifies failure modes from traces, and writes scenarios. The CLI computes: it instruments, executes, judges, and reports. Skills orchestrate the workflow between them.
Zero to eval
The coding agent bootstraps your evals to solve the cold-start problem. You review, not scaffold.
Checks gate the judge
Deterministic checks run before the LLM judge. If a check fails, no tokens are spent.
Trace everything
Auto-instruments Anthropic, OpenAI, and LangChain via OpenTelemetry (OTel).
Dataset-driven evals
Point at a JSONL file, each row becomes a run with its own trace and verdict. Re-run for variance stats, flaky detection, and anomaly flagging.
Structured judges
Define judge criteria in YAML with pass/fail definitions and few-shot examples. Reuse specs across scenarios for consistent grading.
No platform
uv or pip install, BYO API keys, all data stays local. Same CLI on your laptop and in CI.
Skills
Five skills take you from zero to eval, or from traces to targeted iteration.
/audit-evalsAssess readiness, identify testable behaviors, prepare the environment. The default entry point.
/generate-scenariosHappy paths, edge cases, tool usage, error handling, cost bounds. One command.
/generate-judgesBinary pass/fail definitions with few-shot examples, ready to reuse across scenarios.
/validate-judgeTest judge accuracy against human labels. Iterates until TPR and TNR meet threshold.
/diagnose-errorsCategorize failures, identify patterns, recommend next action.
CLI Python 3.10+
Works standalone for CI and local iteration. Checks run before the judge, so obvious failures stop early without spending tokens.
kensa initScaffold with an example agentkensa evalrun + judge + report in one shotkensa runExecute scenarios, capture traceskensa judgeDeterministic checks + LLM judgekensa reportTerminal, markdown, JSON, or HTML outputkensa analyzeCost/latency stats + anomaly flaggingkensa doctorPre-flight environment checksFAQ
What agents does kensa work with?
Any Python agent that makes LLM calls. Auto-instrumentation covers Anthropic, OpenAI, and LangChain out of the box. Other providers work with manual OTel config.
Do I need to modify my agent code?
Two lines: from kensa import instrument; instrument(). Add before your SDK imports. kensa runs your agent in a subprocess and captures traces automatically. Auto instrumented by coding agents.
Can I run kensa in CI?
Yes. kensa eval --format markdown is all you need. Deterministic checks need no API keys. Add judge keys as secrets for LLM-judged criteria.
Is kensa free?
Yes, it is MIT licensed. The only cost is your LLM API calls for judge criteria, and that's optional.