Error Recovery & Retry Policies¶

SHIPIT Agent handles failures gracefully at every level — LLM provider errors, tool execution failures, and hallucinated tool names all produce recoverable error messages instead of crashing the agent run.

How error recovery works¶

When a tool fails after exhausting retries, the runtime produces an error ToolResult message and sends it back to the LLM. The LLM sees the error and can decide to try a different tool, adjust its approach, or report the issue to the user.

LLM: "Call web_search with query='latest news'"
         │
         ▼
    web_search raises ConnectionError
         │
    retry 1 → still fails
         │
         ▼
    Runtime creates error message:
    "Error running tool 'web_search': connection refused"
         │
         ▼
    LLM sees error, decides to try open_url instead
         │
         ▼
    Agent continues running

This is the same pattern used for hallucinated tool names — every tool call gets a paired result message, whether success or error, keeping the conversation balanced for all providers (especially Bedrock).

Retry policy¶

from shipit_agent import Agent, RetryPolicy

agent = Agent(
    llm=llm,
    retry_policy=RetryPolicy(
        max_llm_retries=2,           # retry LLM calls up to 2 times
        max_tool_retries=1,          # retry tool calls up to 1 time
        retry_on_exceptions=(        # only retry these exception types
            ConnectionError,
            TimeoutError,
            OSError,
        ),
    ),
)

Default exceptions¶

The default retry_on_exceptions is (ConnectionError, TimeoutError, OSError) — network and I/O errors that are typically transient. This is intentionally narrow:

Exception type	Retried by default	Why
`ConnectionError`	Yes	Network hiccup, retry likely succeeds
`TimeoutError`	Yes	Server slow, retry may succeed
`OSError`	Yes	I/O issue, often transient
`RuntimeError`	No	Usually a bug, retrying won't help
`ValueError`	No	Bad data, same input = same error
`TypeError`	No	Code bug, fix the code
`KeyError`	No	Missing data, not transient

To retry on additional exceptions:

RetryPolicy(
    retry_on_exceptions=(ConnectionError, TimeoutError, OSError, RuntimeError),
)

Events emitted during failures¶

Event	When	Key payload
`tool_retry`	Tool failed, retrying	`attempt`, `error`, `iteration`
`tool_failed`	Tool failed permanently (or hallucinated name)	`error`, `iteration`
`llm_retry`	LLM call failed, retrying	`attempt`, `error`

Before vs. after (the old behavior)¶

Scenario	Before (v1.0.0)	After (v1.0.2)
Tool raises after retries	Agent crashes, caller gets exception	Error message sent to LLM, agent continues
Hallucinated tool name	Error message sent to LLM	Error message sent to LLM (unchanged)
LLM provider error	Retried, then crashes	Retried, then crashes (unchanged)

Breaking change from 1.0.0

If you were catching tool exceptions from agent.run(), note that tool failures no longer propagate as exceptions. The agent will continue running and include the error in its response. Check result.events for tool_failed events if you need to detect failures programmatically.