Changelog - kensa

0.8.0 — 2026-05-05

Kensa now runs as a pytest plugin, so you can write evals as ordinary tests and run them in your existing test suite.

New @pytest.mark.kensa(...) marker binds a pytest test to a kensa scenario, with cases= for parametrized inputs and trials= for repeated runs against the same case. Recorded outputs are captured and fed through the same checks and judge pipeline as YAML scenarios.
New kensa eval --pytest <path> invokes the bundled pytest plugin and writes a normal kensa run manifest, so reports, judge, and analyze work unchanged.
New kensa init --pytest scaffolds a starter pytest-native eval under tests/evals/.
Scenario schema gained cases and trials as the canonical keys; dataset and input_field remain as legacy aliases.
Fixed cost backfill missing models whose slug ends in a single-digit version segment (e.g. claude-sonnet-4), which previously fell through the slug normalizer.

0.7.0 — 2026-05-01

kensa capture records one real agent invocation, and kensa generate synthesizes scenarios straight from the capture, so you can bootstrap evals without writing any scenarios by hand.

New kensa capture -- <cmd> command runs your agent once with full instrumentation and writes a capture-kind run manifest plus a JSONL trace under .kensa/. Pass -i/--input to mirror scenario.input. kensa run rejects capture-kind manifests, so captures and evals stay separate.
kensa generate is now capture-aware: source priority is --trace → --run-id → latest capture manifest → latest run manifest. Generated scenarios inherit the observed run_command from the manifest verbatim.
kensa init now prompts for a coding agent (-a/--agent claude|codex|cursor|opencode|gemini|other|all|none) and installs the bundled skills into the right directory for that agent. Use --no-cli to skip the uv add --dev kensa step.
Internal: a new RunKind discriminator on the run manifest distinguishes eval runs from capture runs end to end, so the MCP server, report, and judge surfaces only operate on eval runs.

0.6.2 — 2026-04-27

uvx kensa init is now a one-shot bootstrap: scaffold scenarios, add kensa as a dev dep, and drop the Claude Code skills into the project.

New kensa skills install command copies the bundled skills into .claude/skills/ (Claude Code) and .agents/skills/ (Codex, OpenCode, Cursor, and other adopters of the Agent Skills standard). Use --global to install into ~, --claude / --codex to scope to one target, and --force to overwrite existing files.
kensa init gained --cli / --skills flags (and their negations). In an interactive terminal, each step prompts before mutating state. In CI, both default to skip unless passed explicitly.
When kensa init adds kensa via uv add --dev, it now points at uv run kensa doctor if the active interpreter is outside the project venv, so doctor checks reflect the right environment.

0.6.1 — 2026-04-24

Release tooling fix.

Fixed uv.lock drifting from the bumped package version on release. The release script now refreshes the lockfile, and a packaging test guards against future drift in the kensa-mcp shim’s pin.

0.6.0 — 2026-04-24

kensa generate synthesizes new scenarios from real traces, so coverage grows with usage.

New kensa generate command replays the latest run (or a specific --run-id / --trace file) through an LLM and writes fresh scenario YAML to .kensa/scenarios/. Use -n to set the count (1–20), --dry-run to preview, --model to override the LLM, and --force to overwrite existing files.
Fixed the generator shipping invalid scenarios: every synthesized scenario is now validated against the runtime schema and -n is enforced.
Fixed the generator silently returning fewer scenarios than requested: underproduction now surfaces as a warning.
Fixed OpenAI judge verdicts truncating mid-response on reasoning models by switching to max_completion_tokens.
Fixed the MCP scenarios resource URI returning a 404 for clients that followed the documented path.

0.5.2 — 2026-04-18

Cost backfill now recognizes the full range of model slugs SDKs report.

Pricing lookups normalize SDK-reported model IDs against OpenRouter’s canonical dotted slugs, handling provider prefixes, dashed variants, and dated suffixes. Size segments like 70b, 24b, and 405b are left untouched.

0.5.1 — 2026-04-17

Instrumentation is zero-config: agents run without any code changes.

The runner injects a bootstrap sitecustomize.py via PYTHONPATH, so OpenTelemetry and SDK auto-instrumentation are set up before agent code runs. No more from kensa import instrument; instrument() boilerplate in scenario files.
instrument() stays exported as an idempotent escape hatch for environments where sitecustomize can’t run (e.g. python -S). Existing agents that still call it keep working, with no duplicate spans.

0.5.0 — 2026-04-15

Kensa now runs as an MCP server, so any MCP-aware client can drive the full eval workflow as tools.

New kensa mcp subcommand serves the harness over the Model Context Protocol, exposing init, doctor, run, judge, eval, report, and analyze as tools, plus eight kensa:// resources. Stdio by default, --http --port for HTTP transport.
Separate kensa-mcp PyPI shim lets you run uvx kensa-mcp without installing kensa first. The shim pins to the matching kensa[mcp] version and prints a clean install hint if the mcp extra is missing.
MCP errors come back as a stable MCPError(error, code, hint) envelope instead of raising across the protocol boundary, and doctor now distinguishes scenario-not-found from invalid-run-id.

0.4.0 — 2026-04-12

Trajectory checks let you assert that an agent followed the right tool-call path.

New trajectory check type validates tool-call sequences against expected patterns — supports strict ordering and any-order matching, with optional accuracy thresholds and inline budgets.
Aggregate reports now include estimated k-run pass rates per scenario, so you can spot flaky evals without guessing.
Fixed trajectory placeholder validation rejecting valid unordered sequences.

0.3.0 — 2026-04-10

Tool checks now accept lists, so you can assert multiple tools in one check.

Breaking: tool_called and tool_not_called are now tools_called and tools_not_called. They take a list of tool names with set-membership semantics (order-free). Use tool_order when sequence matters.
Validation errors now tell you which item in a scenario is invalid, not just that something is wrong.

0.2.0 — 2026-04-08

Run commands are safer — no more shell interpolation of inputs.

Breaking: run_command now takes an argv list instead of a shell string with {{input}} templates. Input is appended as the final argument. This removes the command-injection surface from the old shlex.quote approach.
Omitted input fields and explicit empty strings are now handled as distinct cases.

0.1.0 — 2026-04-07

Initial harness release.

Each scenario runs in its own subprocess with KENSA_TRACE_DIR set. Add from kensa import instrument; instrument() to your agent — that’s the only code change.
Auto-instruments Anthropic, OpenAI, and LangChain SDKs via OpenTelemetry. Writes tool calls, token counts, and latency as JSONL spans.
Deterministic checks: output_contains, output_not_contains, tools_called, tools_not_called, tool_order, cost_threshold. Checks run before the judge — if a check fails, no tokens are spent.
LLM judge with Anthropic and OpenAI providers. Auto-resolves from whichever API key is set.
Reports in four formats: terminal, markdown, JSON, HTML.
kensa analyze computes multi-run variance, flags flaky scenarios, and reports cost/latency anomalies.
Dataset mode: point dataset at a JSONL file, each row becomes a run.
kensa doctor validates environment, dependencies, API keys, and scenario files.
Five Claude Code skills: audit-evals, generate-scenarios, generate-judges, validate-judge, diagnose-errors.
Five example agents: code reviewer, customer support, incident triage, SDR qualifier, SQL analyst.

Documentation Index

​0.8.0 — 2026-05-05

​0.7.0 — 2026-05-01

​0.6.2 — 2026-04-27

​0.6.1 — 2026-04-24

​0.6.0 — 2026-04-24

​0.5.2 — 2026-04-18

​0.5.1 — 2026-04-17

​0.5.0 — 2026-04-15

​0.4.0 — 2026-04-12

​0.3.0 — 2026-04-10

​0.2.0 — 2026-04-08

​0.1.0 — 2026-04-07