- Published on
Building Real-Time Telemetry Pipelines on AWS
- Authors

- Name
- Benjamin Lee
Why Telemetry Pipelines Are Hard
Most telemetry pipelines look simple on a whiteboard: collect events, ship them somewhere, query them later. In production, the problems multiply fast. Events arrive out of order. Schemas drift across client versions. A single noisy service floods the queue and starves everything downstream.
This post walks through the architecture we used to build a reliable, low-latency telemetry pipeline on AWS — one that handled millions of events per day without compromising on freshness or accuracy.
The Architecture
Producers → API Gateway → Kinesis Data Streams → Lambda (Enrichment) → S3 + DynamoDB
↓
Kinesis Firehose → S3 (raw archive)
1. Ingestion: API Gateway + Kinesis
The entry point is a simple REST endpoint behind API Gateway that accepts batched event payloads and writes directly to a Kinesis Data Stream. We use PutRecords to batch up to 500 events per request, which dramatically reduces per-call overhead.
import boto3, json
kinesis = boto3.client('kinesis', region_name='us-east-1')
def put_events(events: list[dict], stream_name: str):
records = [
{
'Data': json.dumps(event).encode(),
'PartitionKey': event.get('device_id', 'default'),
}
for event in events
]
return kinesis.put_records(Records=records, StreamName=stream_name)
Partition key selection matters: using device_id distributes load evenly across shards and keeps per-device events ordered within a shard.
2. Enrichment: Lambda Consumer
A Lambda function reads from the stream, enriches each event with metadata from DynamoDB (device registry, tenant info), and fans out to downstream destinations.
import boto3, json
dynamodb = boto3.resource('dynamodb')
registry = dynamodb.Table('DeviceRegistry')
def handler(event, context):
enriched = []
for record in event['Records']:
payload = json.loads(record['kinesis']['data'])
device = registry.get_item(Key={'device_id': payload['device_id']}).get('Item', {})
enriched.append({**payload, 'tenant_id': device.get('tenant_id'), 'region': device.get('region')})
# fan out to S3, DynamoDB, alerting...
return {'statusCode': 200, 'processed': len(enriched)}
Key Lambda tuning decisions:
- Batch size: 100 records. Larger batches improve throughput but increase cold-start cost per retry.
- Bisect on error: enabled. Prevents one bad record from blocking an entire shard.
- Parallelization factor: 2 per shard. Doubles throughput without over-provisioning.
3. Storage: S3 + DynamoDB
Hot queries (last 7 days, by device) land in DynamoDB with a TTL of 30 days. Cold/analytical queries go to S3 via Kinesis Firehose, partitioned by year/month/day/hour for Athena compatibility.
The Hard Parts
Schema evolution: We version every event type in the payload ("schema_version": "2") and maintain a schema registry in S3. The enrichment Lambda validates and normalizes before writing downstream. Unknown schema versions get routed to a dead-letter Firehose bucket for manual review.
Backpressure: When DynamoDB write capacity spikes, the Lambda throttles. We added SQS as a buffer between Lambda and DynamoDB writes — it absorbs bursts and smooths write rates automatically.
Monitoring: CloudWatch Metric Filters on Lambda logs emit custom metrics for schema_validation_failures and enrichment_errors. These feed a CloudWatch alarm that pages on-call when error rate exceeds 1%.
Costs at Scale
At 50M events/day across 10 shards:
| Component | Monthly Cost (est.) |
|---|---|
| Kinesis Data Streams (10 shards) | ~$110 |
| Lambda (enrichment) | ~$40 |
| DynamoDB (on-demand) | ~$90 |
| S3 + Firehose (archive) | ~$25 |
| Total | ~$265 |
The biggest lever is shard count. Right-size based on throughput (1 MB/s or 1,000 records/s per shard) and monitor GetRecords.IteratorAgeMilliseconds — if it climbs above a few seconds, add shards.
Takeaways
- Use
PutRecordsbatching at ingestion; it's a 10x cost win over individual puts. - Version your event schemas from day one — retrofitting is painful.
- Bisect-on-error in Lambda is non-negotiable for production streams.
- SQS as a write buffer is cheap insurance against downstream throttling.
The full Terraform module for this stack is something I'm planning to open-source. More on that soon.