Building production AI agents with stateful graph orchestration

Once an agent has more than two tools, the "ReAct loop" stops being an architecture and starts being a liability. Latency stacks, error modes multiply, and there is nowhere honest to draw a boundary for retries.

The shift we made — and the one I now recommend to every team I advise — is to stop modelling the agent as a loop and start modelling it as a state machine where every transition is named, every state is checkpointable, and every tool call is a node, not a side effect.

In LangGraph terms: the graph is the contract. The model is a participant in the graph, not its owner. That single inversion is what lets you do everything that an agent in production actually needs to do — pause, resume, fan out, hand off to a human, replay yesterday's session for a regression suite.

Why loops fail at scale

The canonical ReAct loop is elegant for demos: observe, think, act, repeat. In production it accumulates three failure modes that compound with every additional tool:

Unbounded retries. When a tool call fails, the loop has no natural place to draw a boundary. You add a max_steps guard. Then a team member raises it "just this once". Three months later your p99 latency is 45 seconds and you don't know why.

Non-deterministic replay. When a customer files a bug, you want to replay the exact sequence of decisions. A loop with no named checkpoints gives you a log. A graph gives you a resumable snapshot.

Human handoff. The loop cannot pause and wait. The graph can interrupt at any named node and wait indefinitely for human input, then resume from exactly that state.

The graph topology

# the only loop is the runtime's. the agent is a graph.
graph = StateGraph(AgentState)

graph.add_node("plan",     planner)
graph.add_node("retrieve", retriever)
graph.add_node("act",      tool_runner)
graph.add_node("reflect",  critic)

graph.add_edge(START, "plan")
graph.add_conditional_edges("plan",
    route=lambda s: "retrieve" if s.needs_context else "act")
graph.add_edge("retrieve", "act")
graph.add_conditional_edges("act",
    route=lambda s: END if s.done else "reflect")
graph.add_edge("reflect", "plan")

app = graph.compile(checkpointer=PostgresSaver(dsn))

Checkpointing with Postgres

We use PostgresSaver as the checkpointer. Each transition writes a row to agent_checkpoints(thread_id, step, state_json, created_at). This gives us:

Resume on crash — the runner picks up from the last completed step
Human review — we can inspect any intermediate state
Regression testing — replay from any checkpoint with mocked tools
Cost attribution — each step row includes token counts

The Postgres write adds ~3ms per step. For a 20-step agent that's 60ms. Acceptable.

What you cannot checkpoint

Streaming tool outputs. If your tool streams data to the user mid-graph, you cannot safely replay that transition without re-triggering the stream. We solved this by marking streaming nodes as non_resumable and re-running them fresh on any replay.

Observability

Every graph execution emits a structured event per step: {thread_id, step_name, input_tokens, output_tokens, latency_ms, tool_calls, error}. We ship these to a time-series store and build dashboards per agent per week. When a new tool slows down the median step time, we see it in one day, not one sprint.