Published on

Agentic AI on AWS: Architecture Patterns for Autonomous Systems

Authors
  • avatar
    Name
    Benjamin Lee
    Twitter

The Shift to Agentic Systems

Traditional AI workloads follow a simple path: user request → model → response. Agentic systems are fundamentally different. They plan, execute multi-step tasks, use external tools, and operate with meaningful autonomy over time. AWS's own architecture blog describes this as a shift from "user request → LLM → response" to "user goal → agent network → coordinated actions → outcome."

This changes what the infrastructure has to do. You're no longer just routing HTTP requests to a model endpoint. You're managing long-running stateful processes, tool execution, inter-agent communication, and failure recovery — all at cloud scale.

The Two Core Orchestration Patterns

AWS prescriptive guidance identifies two primary patterns for coordinating agents:

1. Synchronous Orchestration (Supervisor)

A supervisor agent actively directs the flow — it receives the goal, plans subtasks, delegates to specialized worker agents, and synthesizes results. Control is centralized.

User Goal
Supervisor Agent (Bedrock / LangGraph)
    ├── Research Agent → web search, document retrieval
    ├── Analysis Agent → data processing, computation
    └── Writer Agent   → output generation
Final Response

This pattern is easier to reason about and debug. The supervisor is the single source of truth for task state. The downside is that it's a bottleneck — the supervisor must be online for the duration of the task.

AWS implementation: Supervisor on ECS Fargate (long-running), workers on Lambda (short-lived, event-triggered). State in DynamoDB or S3.

2. Asynchronous Choreography (Event-Driven)

Agents operate autonomously, triggered by events. No central coordinator — each agent reacts to messages on a queue or event bus and publishes its outputs for downstream agents to consume.

User GoalSQSResearch Agent
                        (publishes results)
                   EventBridgeAnalysis Agent
                                 SQSWriter AgentOutput

This pattern scales better and is more resilient — a failed agent can retry independently without affecting others. The tradeoff is that the overall workflow is harder to observe and debug. You need distributed tracing to reconstruct what happened.

AWS implementation: SQS for queuing, EventBridge for routing, Lambda for stateless agents, Step Functions for workflow visibility.

State Management at Scale

One of the practical challenges in cloud-deployed agents is that Lambda functions are stateless — they can't hold conversation context between invocations. The solution, per AWS guidance, is to externalize session state to persistent storage and reconstruct it at the start of each invocation.

import boto3, json

s3 = boto3.client('s3')
STATE_BUCKET = "my-agent-state"

def load_state(session_id: str) -> dict:
    try:
        obj = s3.get_object(Bucket=STATE_BUCKET, Key=f"sessions/{session_id}.json")
        return json.loads(obj['Body'].read())
    except s3.exceptions.NoSuchKey:
        return {"messages": [], "steps": 0}

def save_state(session_id: str, state: dict):
    s3.put_object(
        Bucket=STATE_BUCKET,
        Key=f"sessions/{session_id}.json",
        Body=json.dumps(state),
    )

def handler(event, context):
    session_id = event["session_id"]
    state = load_state(session_id)
    # ... run agent step ...
    save_state(session_id, state)

This keeps the Lambda itself stateless for horizontal scaling while delivering a stateful experience to the user. For sub-second reads, DynamoDB is faster than S3 — worth using for hot session data.

Amazon Bedrock AgentCore

For teams that want managed infrastructure rather than rolling their own, Amazon Bedrock AgentCore provides a purpose-built runtime for deploying agents on ECS with built-in identity, observability, and tool execution. It handles the undifferentiated heavy lifting: session management, tool routing, IAM scoping per agent action.

AgentCore can be provisioned with CloudFormation:

Resources:
  MyAgent:
    Type: AWS::Bedrock::AgentCoreAgent
    Properties:
      AgentName: research-agent
      FoundationModel: anthropic.claude-sonnet-4-6-v1
      InstructionConfiguration:
        Instruction: "You are a research agent. Use tools to answer questions accurately."
      MemoryConfiguration:
        EnabledMemoryTypes: [SESSION]

The tradeoff vs. LangGraph-on-ECS is flexibility. AgentCore is faster to set up and fully managed, but you're constrained to Bedrock models and the AgentCore tool interface. LangGraph gives you full control over the graph, any model provider, and custom checkpointing backends.

Observability for Agentic Systems

Standard application monitoring doesn't map well to agents. A single user request might involve 20 LLM calls, 8 tool executions, and 3 retry loops — all invisible in a traditional APM trace.

The tools that actually work:

  • AWS X-Ray: trace requests across Lambda, ECS, Bedrock invocations with a single trace ID threaded through all hops
  • CloudWatch Logs Insights: query structured logs from agent steps to reconstruct execution paths
  • LangSmith: LangChain's dedicated tracing platform — captures every LLM call, tool invocation, and state transition in a browsable UI

For production, instrument every agent node with a trace ID and log the input state, output state, and latency. When something goes wrong — and it will — you need to be able to replay the exact sequence of steps that led to the failure.

Practical Recommendations

If you're starting an agentic project on AWS today:

  1. Use LangGraph for the agent logic — explicit state, built-in checkpointing, human-in-the-loop support
  2. Deploy on ECS Fargate for long-running agents; Lambda for short-lived tool execution workers
  3. Store session state in DynamoDB (hot) and S3 (cold archive)
  4. Use SQS + EventBridge to decouple agents in async workflows
  5. Wire up X-Ray and LangSmith before you ship anything to production

The agentic AI infrastructure space is moving fast. The patterns above are stable enough to build on — the services and framework versions are what will keep changing.


Sources: