Published on

LangGraph in Production: Stateful Agents That Survive Failures

Authors
  • avatar
    Name
    Benjamin Lee
    Twitter

What LangGraph Actually Is

LangGraph is an open-source framework for building stateful, multi-actor applications with LLMs. Its core abstraction is a graph where nodes are functions that perform work, edges define control flow, and a typed state object flows through the entire execution. The framework provides durable execution out of the box — agents can persist through failures and automatically resume from exactly where they left off.

This is a meaningful departure from simple chain-based agents. In a chain, state lives implicitly in the message history. In LangGraph, state is explicit, typed, and managed by the framework. That explicitness is what makes production deployment tractable.

As of October 2025, LangGraph Platform (now called LangSmith Deployment) has been used by nearly 400 companies to deploy agents into production.

The Core Concepts

State

Everything flows through a TypedDict that you define:

from typing import Annotated
from typing_extensions import TypedDict
from langgraph.graph.message import add_messages

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]  # reducer: appends, doesn't overwrite
    steps_taken: int
    requires_review: bool

add_messages is a reducer — when two nodes both update messages, the values are merged (appended) rather than one overwriting the other. You can define custom reducers for any field that needs merge semantics.

Nodes and Edges

Nodes are plain functions. They receive the full state and return a partial update:

from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import ToolNode

llm = ChatAnthropic(model="claude-sonnet-4-6").bind_tools(tools)

def call_model(state: AgentState) -> dict:
    response = llm.invoke(state["messages"])
    return {
        "messages": [response],
        "steps_taken": state["steps_taken"] + 1,
    }

tool_node = ToolNode(tools)

Conditional edges route execution based on current state:

from langgraph.graph import StateGraph, END

def route(state: AgentState) -> str:
    if state["requires_review"]:
        return "human_review"
    last = state["messages"][-1]
    if getattr(last, "tool_calls", None):
        return "tools"
    return END

graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", tool_node)
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", route)
graph.add_edge("tools", "agent")
app = graph.compile()

Checkpointing: The Production-Critical Feature

Checkpointing is what separates a LangGraph agent from a demo agent. Every state transition is persisted to a backend, which means:

  • An agent can run for hours or days and survive process restarts
  • You can inspect the state of any in-flight agent at any time
  • Failed runs can be resumed from the last successful checkpoint, not from scratch

LangGraph ships three checkpointer backends:

BackendWhen to use
MemorySaverDevelopment and testing only — data disappears on restart
SqliteSaverSingle-server deployments, local persistence
PostgresSaverDistributed systems, horizontal scaling
from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(
    "postgresql://user:pass@host:5432/langgraph"
)
app = graph.compile(checkpointer=checkpointer)

# Each run is keyed by thread_id — same thread resumes the same conversation
config = {"configurable": {"thread_id": "user-session-789"}}
result = app.invoke({"messages": [HumanMessage("Summarize Q3 results")]}, config)

# Resume after failure — picks up from last checkpoint automatically
app.invoke(None, config)

The thread ID is how LangGraph ties checkpoints to a specific conversation or task. Use a stable identifier — user ID, session ID, job ID — depending on your use case.

Human-in-the-Loop

LangGraph's interrupt mechanism pauses execution at a defined point and waits for external input before continuing. This is how you build approval flows, escalation paths, and review gates:

from langgraph.types import interrupt

def human_review_node(state: AgentState) -> dict:
    # Execution pauses here — the graph is frozen in the checkpointer
    decision = interrupt({
        "question": "Agent wants to delete production data. Approve?",
        "context": state["messages"][-3:],
    })
    return {"requires_review": False, "approved": decision == "yes"}

The graph resumes when you call app.invoke again with the same thread_id and the human's response. No polling, no timeouts — the state just waits in the checkpointer until you pick it back up.

When to Use LangGraph vs. Simpler Alternatives

Per the LangChain docs, not every agentic use case needs LangGraph's full machinery:

ScenarioRecommendation
Single-turn Q&A with toolscreate_react_agent from langgraph.prebuilt
Multi-step with basic retrycreate_react_agent with recursion_limit
Long-running with persistenceLangGraph + PostgresSaver
Human approval requiredLangGraph + interrupt
Multiple coordinating agentsLangGraph multi-agent supervisor pattern

Start with create_react_agent — it covers most agentic tasks with zero boilerplate. Reach for the graph API when you need explicit branching, persistence, or human gates.

Memory Across Sessions

LangGraph supports two memory scopes:

  • Short-term (in-thread): the message history within a single thread_id. Managed automatically by the checkpointer.
  • Long-term (cross-thread): facts that should persist across separate conversations — user preferences, past decisions, learned context. Stored externally (e.g., a vector store or key-value store) and loaded into state at the start of each run.

For most production agents, you need both. Short-term memory is free with checkpointing. Long-term memory requires designing a retrieval step at graph entry.

The Bottom Line

LangGraph's value proposition is simple: it makes the control flow of your agent as explicit, inspectable, and testable as the rest of your application. The state is typed. The transitions are declared. The persistence is built in. That's what you need when agents run in production for real users on real tasks.


Sources: