AgentBenchmark — Systematic Testing¶
Test your agents systematically with expected outputs, tool usage checks, and detailed pass/fail reports.
Quick start¶
from shipit_agent.deep import AgentBenchmark, TestCase
benchmark = AgentBenchmark(
name="knowledge-eval",
cases=[
TestCase(input="What is Python?", expected_contains=["programming"]),
TestCase(input="What is Docker?", expected_contains=["container"]),
TestCase(input="Explain REST", expected_contains=["http"], expected_not_contains=["graphql"]),
TestCase(input="Search for news", expected_tools=["web_search"]),
],
)
report = benchmark.run(agent)
print(report.summary())
Output¶
Agent Benchmark: knowledge-eval
Cases: 4 passed, 0 failed (4 total)
Pass rate: 100%
Avg iterations: 1.2
[PASS] What is Python?
[PASS] What is Docker?
[PASS] Explain REST
[PASS] Search for news
With retry (for Bedrock rate limits)¶
TestCase options¶
| Field | Description |
|---|---|
input |
Prompt sent to agent |
expected_contains |
Output must contain these words |
expected_not_contains |
Output must NOT contain these |
expected_tools |
These tools must be used |
BenchmarkReport¶
| Property | Description |
|---|---|
passed / failed / total |
Counts |
pass_rate |
0.0 - 1.0 |
summary() |
Human-readable report |
to_dict() |
JSON export for dashboards |
Notebook
notebooks/15_agent_benchmark.ipynb