- Published on
LangGraph in Production: Stateful Agents That Survive Failures
- Authors

- Name
- Benjamin Lee
What LangGraph Actually Is
LangGraph is an open-source framework for building stateful, multi-actor applications with LLMs. Its core abstraction is a graph where nodes are functions that perform work, edges define control flow, and a typed state object flows through the entire execution. The framework provides durable execution out of the box — agents can persist through failures and automatically resume from exactly where they left off.
This is a meaningful departure from simple chain-based agents. In a chain, state lives implicitly in the message history. In LangGraph, state is explicit, typed, and managed by the framework. That explicitness is what makes production deployment tractable.
As of October 2025, LangGraph Platform (now called LangSmith Deployment) has been used by nearly 400 companies to deploy agents into production.
The Core Concepts
State
Everything flows through a TypedDict that you define:
from typing import Annotated
from typing_extensions import TypedDict
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
messages: Annotated[list, add_messages] # reducer: appends, doesn't overwrite
steps_taken: int
requires_review: bool
add_messages is a reducer — when two nodes both update messages, the values are merged (appended) rather than one overwriting the other. You can define custom reducers for any field that needs merge semantics.
Nodes and Edges
Nodes are plain functions. They receive the full state and return a partial update:
from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import ToolNode
llm = ChatAnthropic(model="claude-sonnet-4-6").bind_tools(tools)
def call_model(state: AgentState) -> dict:
response = llm.invoke(state["messages"])
return {
"messages": [response],
"steps_taken": state["steps_taken"] + 1,
}
tool_node = ToolNode(tools)
Conditional edges route execution based on current state:
from langgraph.graph import StateGraph, END
def route(state: AgentState) -> str:
if state["requires_review"]:
return "human_review"
last = state["messages"][-1]
if getattr(last, "tool_calls", None):
return "tools"
return END
graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", tool_node)
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", route)
graph.add_edge("tools", "agent")
app = graph.compile()
Checkpointing: The Production-Critical Feature
Checkpointing is what separates a LangGraph agent from a demo agent. Every state transition is persisted to a backend, which means:
- An agent can run for hours or days and survive process restarts
- You can inspect the state of any in-flight agent at any time
- Failed runs can be resumed from the last successful checkpoint, not from scratch
LangGraph ships three checkpointer backends:
| Backend | When to use |
|---|---|
MemorySaver | Development and testing only — data disappears on restart |
SqliteSaver | Single-server deployments, local persistence |
PostgresSaver | Distributed systems, horizontal scaling |
from langgraph.checkpoint.postgres import PostgresSaver
checkpointer = PostgresSaver.from_conn_string(
"postgresql://user:pass@host:5432/langgraph"
)
app = graph.compile(checkpointer=checkpointer)
# Each run is keyed by thread_id — same thread resumes the same conversation
config = {"configurable": {"thread_id": "user-session-789"}}
result = app.invoke({"messages": [HumanMessage("Summarize Q3 results")]}, config)
# Resume after failure — picks up from last checkpoint automatically
app.invoke(None, config)
The thread ID is how LangGraph ties checkpoints to a specific conversation or task. Use a stable identifier — user ID, session ID, job ID — depending on your use case.
Human-in-the-Loop
LangGraph's interrupt mechanism pauses execution at a defined point and waits for external input before continuing. This is how you build approval flows, escalation paths, and review gates:
from langgraph.types import interrupt
def human_review_node(state: AgentState) -> dict:
# Execution pauses here — the graph is frozen in the checkpointer
decision = interrupt({
"question": "Agent wants to delete production data. Approve?",
"context": state["messages"][-3:],
})
return {"requires_review": False, "approved": decision == "yes"}
The graph resumes when you call app.invoke again with the same thread_id and the human's response. No polling, no timeouts — the state just waits in the checkpointer until you pick it back up.
When to Use LangGraph vs. Simpler Alternatives
Per the LangChain docs, not every agentic use case needs LangGraph's full machinery:
| Scenario | Recommendation |
|---|---|
| Single-turn Q&A with tools | create_react_agent from langgraph.prebuilt |
| Multi-step with basic retry | create_react_agent with recursion_limit |
| Long-running with persistence | LangGraph + PostgresSaver |
| Human approval required | LangGraph + interrupt |
| Multiple coordinating agents | LangGraph multi-agent supervisor pattern |
Start with create_react_agent — it covers most agentic tasks with zero boilerplate. Reach for the graph API when you need explicit branching, persistence, or human gates.
Memory Across Sessions
LangGraph supports two memory scopes:
- Short-term (in-thread): the message history within a single
thread_id. Managed automatically by the checkpointer. - Long-term (cross-thread): facts that should persist across separate conversations — user preferences, past decisions, learned context. Stored externally (e.g., a vector store or key-value store) and loaded into state at the start of each run.
For most production agents, you need both. Short-term memory is free with checkpointing. Long-term memory requires designing a retrieval step at graph entry.
The Bottom Line
LangGraph's value proposition is simple: it makes the control flow of your agent as explicit, inspectable, and testable as the rest of your application. The state is typed. The transitions are declared. The persistence is built in. That's what you need when agents run in production for real users on real tasks.
Sources: