AgentBenchmark — Systematic Testing¶

Test your agents systematically with expected outputs, tool usage checks, and detailed pass/fail reports.

Quick start¶

from shipit_agent.deep import AgentBenchmark, TestCase

benchmark = AgentBenchmark(
    name="knowledge-eval",
    cases=[
        TestCase(input="What is Python?", expected_contains=["programming"]),
        TestCase(input="What is Docker?", expected_contains=["container"]),
        TestCase(input="Explain REST", expected_contains=["http"], expected_not_contains=["graphql"]),
        TestCase(input="Search for news", expected_tools=["web_search"]),
    ],
)

report = benchmark.run(agent)
print(report.summary())

Output¶

Agent Benchmark: knowledge-eval
Cases: 4 passed, 0 failed (4 total)
Pass rate: 100%
Avg iterations: 1.2
  [PASS] What is Python?
  [PASS] What is Docker?
  [PASS] Explain REST
  [PASS] Search for news

With retry (for Bedrock rate limits)¶

report = benchmark.run(agent, retry=3, delay=2.0)

TestCase options¶

Field	Description
`input`	Prompt sent to agent
`expected_contains`	Output must contain these words
`expected_not_contains`	Output must NOT contain these
`expected_tools`	These tools must be used

BenchmarkReport¶

Property	Description
`passed` / `failed` / `total`	Counts
`pass_rate`	0.0 - 1.0
`summary()`	Human-readable report
`to_dict()`	JSON export for dashboards

Notebook

notebooks/15_agent_benchmark.ipynb