- Published on
Agentic AI on AWS: Architecture Patterns for Autonomous Systems
- Authors

- Name
- Benjamin Lee
The Shift to Agentic Systems
Traditional AI workloads follow a simple path: user request → model → response. Agentic systems are fundamentally different. They plan, execute multi-step tasks, use external tools, and operate with meaningful autonomy over time. AWS's own architecture blog describes this as a shift from "user request → LLM → response" to "user goal → agent network → coordinated actions → outcome."
This changes what the infrastructure has to do. You're no longer just routing HTTP requests to a model endpoint. You're managing long-running stateful processes, tool execution, inter-agent communication, and failure recovery — all at cloud scale.
The Two Core Orchestration Patterns
AWS prescriptive guidance identifies two primary patterns for coordinating agents:
1. Synchronous Orchestration (Supervisor)
A supervisor agent actively directs the flow — it receives the goal, plans subtasks, delegates to specialized worker agents, and synthesizes results. Control is centralized.
User Goal
↓
Supervisor Agent (Bedrock / LangGraph)
├── Research Agent → web search, document retrieval
├── Analysis Agent → data processing, computation
└── Writer Agent → output generation
↓
Final Response
This pattern is easier to reason about and debug. The supervisor is the single source of truth for task state. The downside is that it's a bottleneck — the supervisor must be online for the duration of the task.
AWS implementation: Supervisor on ECS Fargate (long-running), workers on Lambda (short-lived, event-triggered). State in DynamoDB or S3.
2. Asynchronous Choreography (Event-Driven)
Agents operate autonomously, triggered by events. No central coordinator — each agent reacts to messages on a queue or event bus and publishes its outputs for downstream agents to consume.
User Goal → SQS → Research Agent
↓ (publishes results)
EventBridge → Analysis Agent
↓
SQS → Writer Agent → Output
This pattern scales better and is more resilient — a failed agent can retry independently without affecting others. The tradeoff is that the overall workflow is harder to observe and debug. You need distributed tracing to reconstruct what happened.
AWS implementation: SQS for queuing, EventBridge for routing, Lambda for stateless agents, Step Functions for workflow visibility.
State Management at Scale
One of the practical challenges in cloud-deployed agents is that Lambda functions are stateless — they can't hold conversation context between invocations. The solution, per AWS guidance, is to externalize session state to persistent storage and reconstruct it at the start of each invocation.
import boto3, json
s3 = boto3.client('s3')
STATE_BUCKET = "my-agent-state"
def load_state(session_id: str) -> dict:
try:
obj = s3.get_object(Bucket=STATE_BUCKET, Key=f"sessions/{session_id}.json")
return json.loads(obj['Body'].read())
except s3.exceptions.NoSuchKey:
return {"messages": [], "steps": 0}
def save_state(session_id: str, state: dict):
s3.put_object(
Bucket=STATE_BUCKET,
Key=f"sessions/{session_id}.json",
Body=json.dumps(state),
)
def handler(event, context):
session_id = event["session_id"]
state = load_state(session_id)
# ... run agent step ...
save_state(session_id, state)
This keeps the Lambda itself stateless for horizontal scaling while delivering a stateful experience to the user. For sub-second reads, DynamoDB is faster than S3 — worth using for hot session data.
Amazon Bedrock AgentCore
For teams that want managed infrastructure rather than rolling their own, Amazon Bedrock AgentCore provides a purpose-built runtime for deploying agents on ECS with built-in identity, observability, and tool execution. It handles the undifferentiated heavy lifting: session management, tool routing, IAM scoping per agent action.
AgentCore can be provisioned with CloudFormation:
Resources:
MyAgent:
Type: AWS::Bedrock::AgentCoreAgent
Properties:
AgentName: research-agent
FoundationModel: anthropic.claude-sonnet-4-6-v1
InstructionConfiguration:
Instruction: "You are a research agent. Use tools to answer questions accurately."
MemoryConfiguration:
EnabledMemoryTypes: [SESSION]
The tradeoff vs. LangGraph-on-ECS is flexibility. AgentCore is faster to set up and fully managed, but you're constrained to Bedrock models and the AgentCore tool interface. LangGraph gives you full control over the graph, any model provider, and custom checkpointing backends.
Observability for Agentic Systems
Standard application monitoring doesn't map well to agents. A single user request might involve 20 LLM calls, 8 tool executions, and 3 retry loops — all invisible in a traditional APM trace.
The tools that actually work:
- AWS X-Ray: trace requests across Lambda, ECS, Bedrock invocations with a single trace ID threaded through all hops
- CloudWatch Logs Insights: query structured logs from agent steps to reconstruct execution paths
- LangSmith: LangChain's dedicated tracing platform — captures every LLM call, tool invocation, and state transition in a browsable UI
For production, instrument every agent node with a trace ID and log the input state, output state, and latency. When something goes wrong — and it will — you need to be able to replay the exact sequence of steps that led to the failure.
Practical Recommendations
If you're starting an agentic project on AWS today:
- Use LangGraph for the agent logic — explicit state, built-in checkpointing, human-in-the-loop support
- Deploy on ECS Fargate for long-running agents; Lambda for short-lived tool execution workers
- Store session state in DynamoDB (hot) and S3 (cold archive)
- Use SQS + EventBridge to decouple agents in async workflows
- Wire up X-Ray and LangSmith before you ship anything to production
The agentic AI infrastructure space is moving fast. The patterns above are stable enough to build on — the services and framework versions are what will keep changing.
Sources:
- Architecting for Agentic AI Development on AWS — AWS Architecture Blog
- Agentic AI Patterns and Workflows on AWS — AWS Prescriptive Guidance
- Secure AI Agents with Amazon Bedrock AgentCore — AWS ML Blog
- Effectively Building AI Agents on AWS Serverless — AWS Compute Blog