Published on

Building Real-Time Telemetry Pipelines on AWS

Authors
  • avatar
    Name
    Benjamin Lee
    Twitter

Why Telemetry Pipelines Are Hard

Most telemetry pipelines look simple on a whiteboard: collect events, ship them somewhere, query them later. In production, the problems multiply fast. Events arrive out of order. Schemas drift across client versions. A single noisy service floods the queue and starves everything downstream.

This post walks through the architecture we used to build a reliable, low-latency telemetry pipeline on AWS — one that handled millions of events per day without compromising on freshness or accuracy.

The Architecture

ProducersAPI GatewayKinesis Data StreamsLambda (Enrichment)S3 + DynamoDB
                                               Kinesis FirehoseS3 (raw archive)

1. Ingestion: API Gateway + Kinesis

The entry point is a simple REST endpoint behind API Gateway that accepts batched event payloads and writes directly to a Kinesis Data Stream. We use PutRecords to batch up to 500 events per request, which dramatically reduces per-call overhead.

import boto3, json

kinesis = boto3.client('kinesis', region_name='us-east-1')

def put_events(events: list[dict], stream_name: str):
    records = [
        {
            'Data': json.dumps(event).encode(),
            'PartitionKey': event.get('device_id', 'default'),
        }
        for event in events
    ]
    return kinesis.put_records(Records=records, StreamName=stream_name)

Partition key selection matters: using device_id distributes load evenly across shards and keeps per-device events ordered within a shard.

2. Enrichment: Lambda Consumer

A Lambda function reads from the stream, enriches each event with metadata from DynamoDB (device registry, tenant info), and fans out to downstream destinations.

import boto3, json

dynamodb = boto3.resource('dynamodb')
registry = dynamodb.Table('DeviceRegistry')

def handler(event, context):
    enriched = []
    for record in event['Records']:
        payload = json.loads(record['kinesis']['data'])
        device = registry.get_item(Key={'device_id': payload['device_id']}).get('Item', {})
        enriched.append({**payload, 'tenant_id': device.get('tenant_id'), 'region': device.get('region')})
    # fan out to S3, DynamoDB, alerting...
    return {'statusCode': 200, 'processed': len(enriched)}

Key Lambda tuning decisions:

  • Batch size: 100 records. Larger batches improve throughput but increase cold-start cost per retry.
  • Bisect on error: enabled. Prevents one bad record from blocking an entire shard.
  • Parallelization factor: 2 per shard. Doubles throughput without over-provisioning.

3. Storage: S3 + DynamoDB

Hot queries (last 7 days, by device) land in DynamoDB with a TTL of 30 days. Cold/analytical queries go to S3 via Kinesis Firehose, partitioned by year/month/day/hour for Athena compatibility.

The Hard Parts

Schema evolution: We version every event type in the payload ("schema_version": "2") and maintain a schema registry in S3. The enrichment Lambda validates and normalizes before writing downstream. Unknown schema versions get routed to a dead-letter Firehose bucket for manual review.

Backpressure: When DynamoDB write capacity spikes, the Lambda throttles. We added SQS as a buffer between Lambda and DynamoDB writes — it absorbs bursts and smooths write rates automatically.

Monitoring: CloudWatch Metric Filters on Lambda logs emit custom metrics for schema_validation_failures and enrichment_errors. These feed a CloudWatch alarm that pages on-call when error rate exceeds 1%.

Costs at Scale

At 50M events/day across 10 shards:

ComponentMonthly Cost (est.)
Kinesis Data Streams (10 shards)~$110
Lambda (enrichment)~$40
DynamoDB (on-demand)~$90
S3 + Firehose (archive)~$25
Total~$265

The biggest lever is shard count. Right-size based on throughput (1 MB/s or 1,000 records/s per shard) and monitor GetRecords.IteratorAgeMilliseconds — if it climbs above a few seconds, add shards.

Takeaways

  • Use PutRecords batching at ingestion; it's a 10x cost win over individual puts.
  • Version your event schemas from day one — retrofitting is painful.
  • Bisect-on-error in Lambda is non-negotiable for production streams.
  • SQS as a write buffer is cheap insurance against downstream throttling.

The full Terraform module for this stack is something I'm planning to open-source. More on that soon.