Context Window Management¶

Long-running agents can exhaust the LLM's context window as tool results accumulate. SHIPIT Agent provides token usage tracking and automatic message compaction to handle this gracefully.

Token usage tracking¶

Every LLMResponse now carries a usage dict with token counts from the provider:

from shipit_agent import Agent
from shipit_agent.llms import OpenAIChatLLM

agent = Agent.with_builtins(llm=OpenAIChatLLM(model="gpt-4o-mini"))

for event in agent.stream("Research quantum computing"):
    if event.type == "run_completed":
        usage = event.payload.get("usage", {})
        print(f"Prompt tokens:     {usage.get('prompt_tokens', 0)}")
        print(f"Completion tokens: {usage.get('completion_tokens', 0)}")
        print(f"Total tokens:      {usage.get('total_tokens', 0)}")

Usage is accumulated across all iterations of the agent loop and reported in the run_completed event.

Automatic message compaction¶

When context_window_tokens is set, the runtime automatically compacts older messages when approaching 75% of the limit:

agent = Agent.with_builtins(
    llm=OpenAIChatLLM(model="gpt-4o-mini"),
    context_window_tokens=128000,  # gpt-4o's context window
)

How compaction works¶

Before each LLM call, the runtime estimates the token count of all messages
If the estimate exceeds 75% of context_window_tokens, older messages are compacted
System messages and the most recent 4 messages are always preserved
Older tool results are condensed into a single summary message
The compacted message is marked with metadata={"compacted": True}

Before compaction:                  After compaction:
─────────────────                   ─────────────────

system: "You are helpful"           system: "You are helpful"
user: "Research X"                  user (compacted): "[web_search]: Top 3..
assistant: "Let me search"                            [open_url]: Page con.."
tool[web_search]: "Top 3 results.." assistant: "Based on the results..."
tool[open_url]: "Page content..."   tool[code_exec]: "Output: 42"
assistant: "Based on the results.." user: "Now analyze the data"
tool[code_exec]: "Output: 42"
user: "Now analyze the data"

When to set it¶

Model	Suggested `context_window_tokens`
GPT-4o / GPT-4o-mini	`128000`
Claude 3.5 / Claude Opus 4	`200000`
Gemini 1.5 Pro	`1000000`
Llama 3.1 70B	`128000`
Bedrock gpt-oss-120b	`128000`

Set to 0 (default) to disable compaction entirely.

Cost tracking with hooks¶

Combine usage tracking with hooks for detailed cost monitoring:

from shipit_agent import Agent, AgentHooks

MODEL_COSTS = {
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},  # per 1M tokens
    "gpt-4o": {"input": 2.50, "output": 10.00},
}

hooks = AgentHooks()
costs = {"total_usd": 0.0}

@hooks.on_after_llm
def track_cost(response):
    model = response.metadata.get("model", "")
    rates = MODEL_COSTS.get(model, {"input": 0, "output": 0})
    prompt_cost = response.usage.get("prompt_tokens", 0) / 1_000_000 * rates["input"]
    completion_cost = response.usage.get("completion_tokens", 0) / 1_000_000 * rates["output"]
    costs["total_usd"] += prompt_cost + completion_cost

agent = Agent.with_builtins(
    llm=OpenAIChatLLM(model="gpt-4o-mini"),
    hooks=hooks,
)
agent.run("Do something complex with multiple tool calls")
print(f"Total cost: ${costs['total_usd']:.4f}")

Provider support¶

Provider	`usage` populated	Fields
OpenAI	Yes	`prompt_tokens`, `completion_tokens`, `total_tokens`
Anthropic	Yes	`prompt_tokens`, `completion_tokens`, `total_tokens`
LiteLLM (Bedrock, Gemini, etc.)	Yes	`prompt_tokens`, `completion_tokens`, `total_tokens`
SimpleEchoLLM (dev/test)	No	Empty dict