AWS Messaging & Event-Driven Architecture

AWS-MESSAGING

Decoupled, asynchronous communication between services — SQS queues, SNS fan-out, Kinesis streaming, EventBridge routing, and how these primitives compose into event-driven architectures.

awssqssnskinesiseventbridgemessagingevent-driven

Overview

In a tightly coupled architecture, service A calls service B directly and synchronously. If B is slow, A waits. If B is down, A fails. If B cannot keep up with A’s request rate, A backs off or drops requests. Every dependent service becomes a potential failure propagation path.

Messaging services break these dependencies. A producer writes a message and moves on — it does not wait for, or even know about, the consumer. A consumer reads messages at its own pace — it does not need to be running when the producer sends. The two components operate independently: they can be scaled separately, deployed independently, fail without cascading, and be replaced without the other side noticing.

AWS offers a suite of messaging primitives that cover different patterns: SQS for point-to-point queuing, SNS for fan-out pub/sub, Kinesis for high-throughput real-time streaming, and EventBridge for event routing across AWS services, custom applications, and SaaS partners. These are not competing services — they compose. Understanding which primitive fits which problem is the foundation of event-driven architecture on AWS.


Amazon SQS — Simple Queue Service

SQS is a fully managed message queue. A producer sends messages to a queue. One or more consumers poll the queue, retrieve messages, process them, and delete them. SQS stores messages durably across multiple AZs until they are explicitly deleted.

Standard Queue

The Standard Queue is the default SQS type:

FIFO Queue

FIFO queues provide stronger guarantees at the cost of lower maximum throughput:

Visibility Timeout

When a consumer reads a message with ReceiveMessage, the message is not deleted. It becomes invisible to all other consumers for the visibility timeout period. The consuming instance is expected to process the message and call DeleteMessage before the timeout expires.

If the consumer crashes, stalls, or fails to call DeleteMessage, the visibility timeout expires and the message becomes visible again — another consumer can pick it up. This prevents message loss when a consumer fails mid-processing.

Set the visibility timeout to be longer than your maximum expected processing time. If processing can take 30 seconds, set the timeout to 60 seconds. A consumer can call ChangeMessageVisibility to extend the timeout dynamically if processing takes longer than anticipated.

Dead-Letter Queue (DLQ)

A DLQ is a separate SQS queue that receives messages that fail processing repeatedly. Configure a maxReceiveCount on the source queue — if a message is received more than this many times without being deleted, SQS moves it to the DLQ automatically.

PurposeDetail
DebuggingInspect failed messages without losing them. Examine why processing failed.
AlertingSet a CloudWatch alarm on the DLQ ApproximateNumberOfMessagesVisible metric.
ReprocessingAfter fixing the bug, replay messages from DLQ back to the source queue via the console or CLI.
IsolationFailed messages do not block other messages from being processed.

DLQs apply to both Standard and FIFO queues. A FIFO DLQ must be used with a FIFO source queue.

Long Polling

By default, ReceiveMessage returns immediately — even if the queue is empty. This short polling generates many empty API responses, increasing cost and CPU overhead on the consumer.

Long polling instructs SQS to wait up to 20 seconds (WaitTimeSeconds) for a message to arrive before returning. If a message arrives during the wait, it is returned immediately. If the timeout expires with no message, an empty response is returned. Long polling reduces empty responses by 90%+ and lowers SQS cost. Enable it on the queue via ReceiveMessageWaitTimeSeconds or per individual request.

Key Limits

AttributeStandard QueueFIFO Queue
Message retention4 days default, up to 14 daysSame
Max message size256 KBSame
Max long poll wait20 secondsSame
Max visibility timeout12 hoursSame
Inflight messages120,00020,000

For payloads larger than 256 KB, use the SQS Extended Client Library (Java/Python), which stores the message body in S3 and puts only the S3 reference in the SQS message. The consumer retrieves the reference, fetches from S3, and processes the full payload.


Amazon SNS — Simple Notification Service

SNS is a fully managed pub/sub messaging service. A publisher sends a message to an SNS topic. SNS immediately delivers the message to all subscribed endpoints simultaneously. The publisher has no knowledge of who the subscribers are or how many there are.

Subscription Types

Subscription ProtocolUse Case
SQSFan-out to queue for async processing. Queue absorbs traffic spikes.
LambdaInvoke Lambda function directly.
HTTP/HTTPSPOST message to a web endpoint. Requires subscription confirmation.
Email / Email-JSONHuman notification. Email-JSON delivers raw JSON payload.
SMSText message to a phone number via SNS mobile.
Kinesis Data FirehoseRoute events to S3, Redshift, or OpenSearch via Firehose.
Mobile PushApple Push Notification Service (APNS), Firebase Cloud Messaging (FCM/GCM).

Message Filtering

Without filtering, every subscriber receives every message published to the topic. Subscription filter policies allow each subscriber to declare which messages it wants. A filter policy is a JSON object specifying attribute conditions.

Example: An e-commerce order topic receives events of type order_placed, payment_failed, and order_shipped. The fulfillment service subscribes with a filter for order_placed only. The billing service subscribes with a filter for payment_failed. Neither service sees the other’s messages.

Filter policies evaluate message attributes — key-value pairs attached to the SNS message. Conditions can match exact values, a list of values, numeric ranges, or the presence or absence of an attribute. A message is delivered to a subscriber only if its attributes match the subscriber’s filter policy.

Fan-Out Pattern

The most common SNS + SQS pattern combines both services:

  1. A single event is published to an SNS topic.
  2. SNS delivers the event to multiple SQS queues simultaneously.
  3. Each queue has its own consumer service that processes the event independently.

This achieves true fan-out: multiple downstream systems process the same event without knowing about each other. Each queue provides buffering and retry semantics independently. If one downstream service is slow, its queue absorbs the backlog without affecting others.

FIFO SNS Topics

SNS FIFO topics provide ordered, deduplicated delivery — but they can only deliver to SQS FIFO queues. Use SNS FIFO when you need fan-out with strict ordering. The ordering and deduplication guarantees propagate from the SNS topic through to the SQS FIFO queues.


Amazon Kinesis

Kinesis is a platform for real-time data streaming at scale. Where SQS and SNS are message-oriented — individual records consumed and deleted — Kinesis is stream-oriented: records are retained for a configurable period, and multiple independent consumers can read the same data independently.

Kinesis Data Streams

Data producers (applications, IoT devices, log agents, clickstream collectors) write records to a Kinesis Data Stream. The stream is divided into shards — the unit of capacity.

Each shard supports:

Scale by adding shards. A stream with 10 shards handles 10 MB/s ingest. Records are ordered within a shard. A partition key determines which shard receives a record — records with the same partition key always land on the same shard. Use a high-cardinality partition key (user ID, session ID, device ID) to distribute load evenly across shards.

Data retention: 24 hours by default, extendable to up to 365 days. During the retention window, any consumer can read any record, replay from any position, or read at multiple speeds independently.

Shard Consumers: Classic vs Enhanced Fan-Out

ModeDescriptionThroughput
Classic (GetRecords)Consumer polls the shard. The 2 MB/s egress limit is shared across all consumers of that shard. Five consumers on one shard each get approximately 400 KB/s.Shared 2 MB/s per shard
Enhanced Fan-OutEach registered consumer receives a dedicated 2 MB/s throughput per shard. Data is pushed via HTTP/2. Eliminates consumer competition for bandwidth.Dedicated 2 MB/s per consumer per shard

Enhanced Fan-Out is recommended when multiple independent services consume the same stream — for example, a real-time analytics pipeline, a monitoring pipeline, and an archival pipeline all reading the same clickstream simultaneously at full speed.

Kinesis Data Firehose

Firehose is a fully managed delivery service that reads from Kinesis Data Streams (or directly from producers) and delivers data to storage and analytics destinations:

Firehose handles batching, compression, error handling, and retry automatically. No consumer code required. Configure the destination and buffer settings and Firehose does the rest.

Lambda transformation: Attach a Lambda function to a Firehose delivery stream. Firehose passes batches of records to Lambda before delivery. Lambda can filter out unwanted records, parse formats, enrich records by calling external APIs, or convert JSON to Parquet. Records that Lambda marks as failed go to an S3 error bucket.

Kinesis Data Analytics

Run SQL queries or Apache Flink applications against streaming data in real time:

Use cases: detect payment anomalies from a transaction stream, compute 60-second rolling averages of IoT sensor readings, join a clickstream with a product catalog stream, filter error events from application logs in real time.

SQS vs Kinesis

DimensionSQSKinesis Data Streams
Delivery modelQueue — message processed by one consumer, then deletedStream — records retained, multiple independent readers
ReplayNo — DLQ captures failures but no replay of normal messagesYes — replay from any position within retention window
OrderingStandard: best-effort. FIFO: strict within groupStrict ordering within shard
Multiple consumersNo — each message goes to one consumerYes — multiple consumers read same records independently
RetentionUp to 14 daysUp to 365 days
ThroughputUnlimited (Standard)Bounded by shard count; scale shards to grow
Best forTask queues, job distribution, decouplingReal-time analytics, event sourcing, multi-consumer pipelines

Amazon EventBridge

EventBridge is a serverless event bus that routes events from sources to targets based on rules. It is the central nervous system for event-driven architectures on AWS — connecting AWS services, custom applications, and SaaS partners without writing polling or integration code.

Event Buses

Bus TypeDescription
Default busReceives events from AWS services automatically. EC2 state changes, S3 object events when configured, CloudTrail API calls, CodePipeline stage transitions, and more.
Custom busReceives events from your application code via the PutEvents API. Create one bus per application domain or microservice boundary.
Partner busReceives events from SaaS partner services — Datadog, Zendesk, Shopify, Auth0, GitHub, PagerDuty, and others. Partners publish directly to your partner event bus.

Rules

A rule has two parts: a filter pattern and one or more targets.

Event pattern: A JSON object that matches events by field values. Match on source, detail-type, specific keys within the detail block, account ID, or region. Supports exact match, prefix, suffix, wildcard, numeric ranges, and anything-but negation.

Targets: Up to five targets per rule. When an event matches the pattern, EventBridge invokes all targets simultaneously:

TargetNotes
Lambda functionMost common. Invoked synchronously with the full event payload.
SQS queueQueues the event for async processing. Absorbs traffic spikes.
SNS topicFan-out the event to multiple subscribers.
Step Functions state machineStart a workflow execution with the event as input.
ECS taskRun a container task in response to the event.
API GatewayHTTP POST to a REST or HTTP API endpoint.
Kinesis Data Stream or FirehoseRoute events into a streaming pipeline.
Another EventBridge busForward events across accounts or organizational units.

Input transformation: Before delivering to a target, EventBridge can transform the event payload. Extract specific fields, rename keys, add static values, or construct a new JSON document. This avoids the need for a Lambda adapter function just to reshape an event before passing it to a target.

Scheduled Rules

EventBridge supports scheduled invocation using:

Scheduled rules replace traditional cron jobs on servers. Invoke a Lambda function to clean up stale records, trigger a Step Functions workflow for nightly batch processing, or run an ECS task to generate reports. No server or cron daemon required.

Archive and Replay

EventBridge can archive events flowing through a bus — all events, or a filtered subset matching a pattern. Archives are stored in S3-backed storage managed by EventBridge. Configure an optional retention period.

Replay: Replay archived events back through the bus at any time. All rules evaluate replayed events as if they were new and invoke targets accordingly. Use replay to debug a new rule by testing it against historical events, reprocess events after a consumer bug is fixed, or populate a new downstream system with historical data.

Schema Registry

As events flow through EventBridge, the schema registry discovers and records their structure automatically. Schemas describe the detail field — field names, types, and nested objects.

From discovered schemas, EventBridge generates code bindings: strongly-typed classes and deserialization code for TypeScript, Python, Java, and Go. Your Lambda or application code receives a typed object instead of a raw JSON map, with IDE autocompletion for event fields.


Comparing the Services

ScenarioRecommended Service
Decouple two services, one consumer per messageSQS Standard Queue
Ordered processing, exactly-once, transactionalSQS FIFO Queue
Broadcast one event to multiple services simultaneouslySNS Topic
Fan-out with per-subscriber message filteringSNS with filter policies
Real-time stream, multiple independent consumers, replayKinesis Data Streams
Deliver stream data to S3 or Redshift, no consumer codeKinesis Data Firehose
Real-time stream SQL or Flink analyticsKinesis Data Analytics
React to AWS service events, route to multiple targetsEventBridge
Schedule periodic invocations, replace cron jobsEventBridge Scheduled Rules
Replay historical events for debugging or reprocessingEventBridge Archive and Replay
Orchestrate multi-step workflows with retries and branchesStep Functions

Fan-Out Pattern: File Upload Pipeline

A common real-world composition: a user uploads a file to S3. Multiple downstream systems must react — thumbnail generation, metadata extraction, and a notification service. Each system is independent and scales separately.

User
S3
PUT /uploads/photo.jpg
File upload completes
S3 Event Notification
s3:ObjectCreated published to topic
Fan-out — thumbnail service
SNS delivers copy to Queue A
Fan-out — metadata extractor
SNS delivers copy to Queue B
Fan-out — notification
SNS invokes Lambda directly
Thumbnail worker polls Queue A
Resizes image, writes thumbnails to S3
Metadata worker polls Queue B
Extracts EXIF data, writes to DynamoDB
Lambda sends notification
Publishes to user via SNS SMS or email

Each service processes the same upload event independently. Queue A and Queue B provide buffering — if the thumbnail worker is temporarily overwhelmed, messages accumulate in the queue and are processed when capacity returns, without affecting the metadata extractor or the notification Lambda. If either consumer fails repeatedly, messages move to a DLQ for inspection without blocking the others.


References