Scenarios

YAML-based test scenarios that define what to run and how to judge it.

Scenarios are YAML files in .kensa/scenarios/. Your coding agent generates these, but you can write them by hand.

Full example

id: classify_ticket
name: Support ticket triage
description: Classify a support ticket by severity.
source: user                            # code | traces | user

input: "Our entire team can't log in. SSO has returned 502 since 7am."
run_command: python agent.py {{input}}

expected_outcome: Agent returns the correct priority label.

checks:
  - type: output_matches
    params: { pattern: "^P[123]$" }
    description: Output must be exactly P1, P2, or P3.
  - type: max_cost
    params: { max_usd: 0.05 }
    description: Stay under five cents.

criteria: |
  P1 is for outages or data loss affecting multiple users.
  The agent must classify based on business impact, not tone.

Fields

FieldRequiredDescription
idYesUnique identifier
nameYesHuman-readable name
descriptionNoWhat this scenario tests
sourceNoHow it was generated: code, traces, or user
inputYesThe prompt or input to the agent
run_commandYesShell command to execute. {{input}} is interpolated
env_overridesNoExtra environment variables for this scenario's subprocess
datasetNoPath to a JSONL file for parameterized inputs (requires input_field)
input_fieldNoJSONL field to use as the input (required when dataset is set)
expected_outcomeNoNatural-language description of success
checksNoList of deterministic checks
criteriaNoNatural-language criteria for the LLM judge (mutually exclusive with judge)
judgeNoReference to a judge spec in .kensa/judges/ (mutually exclusive with criteria)
trace_refsNoPaths to previous trace files for context
failure_patternNoKnown failure pattern this scenario targets

Checks vs criteria

Checks are deterministic and free. Use them for objective, binary conditions:

  • Was a specific tool called?
  • Did the agent stay under budget?
  • Did it complete in fewer than N turns?

Criteria are evaluated by the LLM judge and cost tokens. Use them for subjective or nuanced conditions:

  • Did the agent confirm before taking action?
  • Was the response professional in tone?
  • Did the agent avoid hallucinating details?

Checks run first. If any check fails, criteria are skipped (fail-fast).

Dataset-driven scenarios

Point at a JSONL file where each row becomes a separate run. Both dataset and input_field are required:

id: booking_variations
name: Booking across routes
dataset: data/routes.jsonl
input_field: query
run_command: python agent.py {{input}}

checks:
  - type: tool_called
    params: { name: search_flights }
  - type: max_turns
    params: { max: 5 }

criteria: |
  The agent must confirm with the user before booking.
  The final answer must include a confirmation number.

The input_field specifies which JSONL field becomes the scenario input. Other fields can be referenced in check params via {{...}} placeholders. Re-run for variance stats, flaky detection, and anomaly flagging.