Scenarios

Scenarios are YAML files in .kensa/scenarios/. Your coding agent generates these, but you can write them by hand.

Full example

id: classify_ticket
name: Support ticket triage
description: Classify a support ticket by severity.
source: user                            # code | traces | user

input: "Our entire team can't log in. SSO has returned 502 since 7am."
run_command: [python, agent.py]

expected_outcome: Agent returns the correct priority label.

checks:
  - type: trajectory
    params:
      steps:
        - tool: classify_ticket
      max_steps: 1
      max_tokens: 2000
    description: One classifier tool call, no extra wandering.
  - type: output_matches
    params: { pattern: "^P[123]$" }
    description: Output must be exactly P1, P2, or P3.

criteria: |
  P1 is for outages or data loss affecting multiple users.
  The agent must classify based on business impact, not tone.

Fields

Field	Required	Description
`id`	Yes	Unique identifier
`name`	No	Human-readable name. Defaults to `id`.
`description`	No	What this scenario tests
`source`	No	How it was generated: `code`, `traces`, or `user`
`input`	No	Literal input, or the JSONL field selector when `cases` is set.
`cases`	No	Path to a JSONL file for parameterized cases. Resolves relative to the scenario YAML.
`trials`	No	Number of repeated executions per case. Defaults to `1` (`smoke`). Values above 1 are measured runs.
`run_command`	Command mode	Argv list passed to `subprocess.run` (no shell). When literal `input` is set, it is appended as the final argv element.
`env_overrides`	No	Extra environment variables for this scenario’s subprocess
`dataset`	No	Legacy alias for `cases`
`input_field`	No	Legacy alias for `input` when `cases`/`dataset` is set
`expected_outcome`	No	Natural-language description of success
`checks`	No	List of deterministic checks
`criteria`	No	Natural-language criteria for the LLM judge (mutually exclusive with `judge`)
`judge`	No	Reference to a judge spec in `.kensa/judges/` (mutually exclusive with `criteria`)
`trace_refs`	No	Paths to previous trace files for context
`failure_pattern`	No	Known failure pattern this scenario targets

Checks vs criteria

Checks are deterministic and free. Use them for objective, binary conditions:

Was a specific tool called?
Did the agent follow the expected tool trajectory?
Did the agent stay under budget?
Did it complete in fewer than N turns?

Criteria are evaluated by the LLM judge and cost tokens. Use them for subjective or nuanced conditions:

Did the agent confirm before taking action?
Was the response professional in tone?
Did the agent avoid hallucinating details?

Checks run first. If any check fails, criteria are skipped (fail-fast).

Case-driven scenarios

Point at a JSONL file where each row becomes a separate case. Use cases for the file and input for the field selector:

id: booking_variations
name: Booking across routes
cases: data/routes.jsonl
input: query
run_command: [python, agent.py]

checks:
  - type: tools_called
    params: { tools: [search_flights] }
  - type: max_turns
    params: { max: 5 }

criteria: |
  The agent must confirm with the user before booking.
  The final answer must include a confirmation number.

The selected input field becomes the scenario input. Other fields can be referenced in check params via {{...}} placeholders. dataset and input_field still load for older scenarios, but new scenarios should use cases and input. For pytest-native evals, case rows often hold partial conversations:

{"id":"draft_no_send","messages":[{"role":"user","content":"Draft it, but do not send it."}]}

The pytest driver can then pass case.messages into the real application and record the result with case.output(...).

Trajectory checks

Use trajectory when tool-call correctness matters more than any single tool event:

checks:
  - type: trajectory
    params:
      steps:
        - tool: search_docs
        - tool: answer_user
      ordering: exact      # or: any_order
      args: ignore         # or: exact
      min_accuracy: 0.8
      max_steps: 3
      max_tokens: 2500
      max_duration_seconds: 15

This check emits trajectory_accuracy and step_efficiency metrics in reports. In V1, each scenario can define at most one trajectory check.

Getting started

Reference

Workflows

Releases

Full example

Fields

Checks vs criteria

Case-driven scenarios

Trajectory checks

Getting started

Reference

Workflows

Releases

Documentation Index

​Full example

​Fields

​Checks vs criteria

​Case-driven scenarios

​Trajectory checks

Full example

Fields

Checks vs criteria

Case-driven scenarios

Trajectory checks