Checks

Checks run before the LLM judge to save cost. If any check fails, the judge is skipped (fail-fast). A scenario passes only when all checks pass AND the judge passes.

Check types

Check	What it tests
`output_contains`	Output includes a string or pattern
`output_matches`	Output matches a regex
`tools_called`	All listed tools were invoked (set membership, order-free)
`tools_not_called`	None of the listed tools were invoked
`tool_order`	Tools called in this temporal sequence (use only when order is load-bearing)
`trajectory`	Match the expected tool-call path, optionally with accuracy threshold and inline budgets
`max_cost`	Total cost under threshold
`max_turns`	LLM call count under limit
`max_duration`	Execution time under limit
`no_repeat_calls`	No duplicate tool calls with identical arguments

Examples

Output checks

checks:
  # String containment (case-insensitive by default)
  - type: output_contains
    params: { value: "confirmation number" }

  # Case-sensitive containment
  - type: output_contains
    params: { value: "OK", case_sensitive: true }

  # Regex match
  - type: output_matches
    params: { pattern: "\\d{6,}" }
    description: Output contains a 6+ digit number

Tool checks

checks:
  # Tools were called (set membership, order-free)
  - type: tools_called
    params: { tools: [search_flights] }

  # Tools were NOT called (safety check)
  - type: tools_not_called
    params: { tools: [delete_account] }
    description: Agent must never call delete

  # Tools called in order
  - type: tool_order
    params: { order: [search_flights, book_flight] }
    description: Must search before booking

  # Canonical tool-call trajectory with optional budgets
  - type: trajectory
    params:
      steps:
        - tool: search_flights
        - tool: book_flight
      ordering: exact
      args: ignore
      min_accuracy: 1.0
      max_steps: 2
      max_tokens: 2000
      max_duration_seconds: 30
    description: Search, then book, within budget

  # No duplicate calls (trace-wide; flags any tool called twice with the same args)
  - type: no_repeat_calls
    description: Agent should not redo identical work

trajectory is the higher-level path check for tool correctness. It emits trajectory_accuracy and step_efficiency metrics in reports, and in V1 it is limited to one trajectory check per scenario.

Resource checks

checks:
  # Cost cap
  - type: max_cost
    params: { max_usd: 0.10 }
    description: Under 10 cents

  # Turn limit
  - type: max_turns
    params: { max: 5 }
    description: Complete in 5 LLM calls

  # Time limit
  - type: max_duration
    params: { max_seconds: 30 }
    description: Under 30 seconds

Adding a check

Checks use a registry pattern. To add a new check type:

Add a value to CheckType in models.py
Write a check function in checks.py
Register it in CHECK_REGISTRY

# checks.py
def check_my_check(spans: list[Span], params: dict[str, Any]) -> CheckResult:
    # Your logic here
    return CheckResult(check="my_check", passed=True, detail="...")


CHECK_REGISTRY: dict[CheckType, CheckFn] = {
    # ...existing checks...
    CheckType.MY_CHECK: check_my_check,
}

No call-site changes needed. The registry handles dispatch.

Getting started

Reference

Workflows

Releases

Check types

Examples

Output checks

Tool checks

Resource checks

Adding a check

Getting started

Reference

Workflows

Releases

Documentation Index

​Check types

​Examples

​Output checks

​Tool checks

​Resource checks

​Adding a check

Check types

Examples

Output checks

Tool checks

Resource checks

Adding a check