Scenarios
YAML-based test scenarios that define what to run and how to judge it.
Scenarios are YAML files in .kensa/scenarios/. Your coding agent generates these, but you can write them by hand.
Full example
id: classify_ticket
name: Support ticket triage
description: Classify a support ticket by severity.
source: user # code | traces | user
input: "Our entire team can't log in. SSO has returned 502 since 7am."
run_command: python agent.py {{input}}
expected_outcome: Agent returns the correct priority label.
checks:
- type: output_matches
params: { pattern: "^P[123]$" }
description: Output must be exactly P1, P2, or P3.
- type: max_cost
params: { max_usd: 0.05 }
description: Stay under five cents.
criteria: |
P1 is for outages or data loss affecting multiple users.
The agent must classify based on business impact, not tone.
Fields
| Field | Required | Description |
|---|---|---|
id | Yes | Unique identifier |
name | Yes | Human-readable name |
description | No | What this scenario tests |
source | No | How it was generated: code, traces, or user |
input | Yes | The prompt or input to the agent |
run_command | Yes | Shell command to execute. {{input}} is interpolated |
env_overrides | No | Extra environment variables for this scenario's subprocess |
dataset | No | Path to a JSONL file for parameterized inputs (requires input_field) |
input_field | No | JSONL field to use as the input (required when dataset is set) |
expected_outcome | No | Natural-language description of success |
checks | No | List of deterministic checks |
criteria | No | Natural-language criteria for the LLM judge (mutually exclusive with judge) |
judge | No | Reference to a judge spec in .kensa/judges/ (mutually exclusive with criteria) |
trace_refs | No | Paths to previous trace files for context |
failure_pattern | No | Known failure pattern this scenario targets |
Checks vs criteria
Checks are deterministic and free. Use them for objective, binary conditions:
- Was a specific tool called?
- Did the agent stay under budget?
- Did it complete in fewer than N turns?
Criteria are evaluated by the LLM judge and cost tokens. Use them for subjective or nuanced conditions:
- Did the agent confirm before taking action?
- Was the response professional in tone?
- Did the agent avoid hallucinating details?
Checks run first. If any check fails, criteria are skipped (fail-fast).
Dataset-driven scenarios
Point at a JSONL file where each row becomes a separate run. Both dataset and input_field are required:
id: booking_variations
name: Booking across routes
dataset: data/routes.jsonl
input_field: query
run_command: python agent.py {{input}}
checks:
- type: tool_called
params: { name: search_flights }
- type: max_turns
params: { max: 5 }
criteria: |
The agent must confirm with the user before booking.
The final answer must include a confirmation number.
The input_field specifies which JSONL field becomes the scenario input. Other fields can be referenced in check params via {{...}} placeholders. Re-run for variance stats, flaky detection, and anomaly flagging.