AWS Messaging & Event-Driven Architecture

Overview

In a tightly coupled architecture, service A calls service B directly and synchronously. If B is slow, A waits. If B is down, A fails. If B cannot keep up with A’s request rate, A backs off or drops requests. Every dependent service becomes a potential failure propagation path.

Messaging services break these dependencies. A producer writes a message and moves on — it does not wait for, or even know about, the consumer. A consumer reads messages at its own pace — it does not need to be running when the producer sends. The two components operate independently: they can be scaled separately, deployed independently, fail without cascading, and be replaced without the other side noticing.

AWS offers a suite of messaging primitives that cover different patterns: SQS for point-to-point queuing, SNS for fan-out pub/sub, Kinesis for high-throughput real-time streaming, and EventBridge for event routing across AWS services, custom applications, and SaaS partners. These are not competing services — they compose. Understanding which primitive fits which problem is the foundation of event-driven architecture on AWS.

Amazon SQS — Simple Queue Service

SQS is a fully managed message queue. A producer sends messages to a queue. One or more consumers poll the queue, retrieve messages, process them, and delete them. SQS stores messages durably across multiple AZs until they are explicitly deleted.

Standard Queue

The Standard Queue is the default SQS type:

At-least-once delivery: AWS guarantees every message will be delivered at least once, but a message may occasionally be delivered more than once. Consumers must be idempotent — processing a duplicate must produce the same result as processing it once.
Best-effort ordering: Messages are generally delivered in the order they were sent, but order is not guaranteed. High-throughput distributed systems occasionally deliver messages out of sequence.
Nearly unlimited throughput: Standard queues scale to any transaction rate. There is no throughput cap.
Use cases: Task queues, job distribution, decoupling microservices where duplicate-safe processing is straightforward.

FIFO Queue

FIFO queues provide stronger guarantees at the cost of lower maximum throughput:

Exactly-once processing: The queue deduplicates messages using a deduplication ID (content-based or explicit). A message sent twice within the deduplication window (5 minutes) is delivered once.
Strict ordering: Messages are delivered in the exact order they were sent.
Throughput limits: 300 transactions per second (TPS) without batching. 3,000 TPS with batching (up to 10 messages per API call). Higher throughput requires High Throughput FIFO mode (up to 30,000 TPS).
Message Group ID: Groups messages into ordered sequences. Within a group, ordering is strict. Different groups can be processed in parallel by different consumers. Use message group ID to parallelize workloads while preserving order within logical entities — for example, all events for a given order ID always arrive in sequence.

Visibility Timeout

When a consumer reads a message with ReceiveMessage, the message is not deleted. It becomes invisible to all other consumers for the visibility timeout period. The consuming instance is expected to process the message and call DeleteMessage before the timeout expires.

If the consumer crashes, stalls, or fails to call DeleteMessage, the visibility timeout expires and the message becomes visible again — another consumer can pick it up. This prevents message loss when a consumer fails mid-processing.

Set the visibility timeout to be longer than your maximum expected processing time. If processing can take 30 seconds, set the timeout to 60 seconds. A consumer can call ChangeMessageVisibility to extend the timeout dynamically if processing takes longer than anticipated.

Dead-Letter Queue (DLQ)

A DLQ is a separate SQS queue that receives messages that fail processing repeatedly. Configure a maxReceiveCount on the source queue — if a message is received more than this many times without being deleted, SQS moves it to the DLQ automatically.

Purpose	Detail
Debugging	Inspect failed messages without losing them. Examine why processing failed.
Alerting	Set a CloudWatch alarm on the DLQ `ApproximateNumberOfMessagesVisible` metric.
Reprocessing	After fixing the bug, replay messages from DLQ back to the source queue via the console or CLI.
Isolation	Failed messages do not block other messages from being processed.

DLQs apply to both Standard and FIFO queues. A FIFO DLQ must be used with a FIFO source queue.

Long Polling

By default, ReceiveMessage returns immediately — even if the queue is empty. This short polling generates many empty API responses, increasing cost and CPU overhead on the consumer.

Long polling instructs SQS to wait up to 20 seconds (WaitTimeSeconds) for a message to arrive before returning. If a message arrives during the wait, it is returned immediately. If the timeout expires with no message, an empty response is returned. Long polling reduces empty responses by 90%+ and lowers SQS cost. Enable it on the queue via ReceiveMessageWaitTimeSeconds or per individual request.

Key Limits

Attribute	Standard Queue	FIFO Queue
Message retention	4 days default, up to 14 days	Same
Max message size	256 KB	Same
Max long poll wait	20 seconds	Same
Max visibility timeout	12 hours	Same
Inflight messages	120,000	20,000

For payloads larger than 256 KB, use the SQS Extended Client Library (Java/Python), which stores the message body in S3 and puts only the S3 reference in the SQS message. The consumer retrieves the reference, fetches from S3, and processes the full payload.

Amazon SNS — Simple Notification Service

SNS is a fully managed pub/sub messaging service. A publisher sends a message to an SNS topic. SNS immediately delivers the message to all subscribed endpoints simultaneously. The publisher has no knowledge of who the subscribers are or how many there are.

Subscription Types

Subscription Protocol	Use Case
SQS	Fan-out to queue for async processing. Queue absorbs traffic spikes.
Lambda	Invoke Lambda function directly.
HTTP/HTTPS	POST message to a web endpoint. Requires subscription confirmation.
Email / Email-JSON	Human notification. Email-JSON delivers raw JSON payload.
SMS	Text message to a phone number via SNS mobile.
Kinesis Data Firehose	Route events to S3, Redshift, or OpenSearch via Firehose.
Mobile Push	Apple Push Notification Service (APNS), Firebase Cloud Messaging (FCM/GCM).

Message Filtering

Without filtering, every subscriber receives every message published to the topic. Subscription filter policies allow each subscriber to declare which messages it wants. A filter policy is a JSON object specifying attribute conditions.

Example: An e-commerce order topic receives events of type order_placed, payment_failed, and order_shipped. The fulfillment service subscribes with a filter for order_placed only. The billing service subscribes with a filter for payment_failed. Neither service sees the other’s messages.

Filter policies evaluate message attributes — key-value pairs attached to the SNS message. Conditions can match exact values, a list of values, numeric ranges, or the presence or absence of an attribute. A message is delivered to a subscriber only if its attributes match the subscriber’s filter policy.

Fan-Out Pattern

The most common SNS + SQS pattern combines both services:

A single event is published to an SNS topic.
SNS delivers the event to multiple SQS queues simultaneously.
Each queue has its own consumer service that processes the event independently.

This achieves true fan-out: multiple downstream systems process the same event without knowing about each other. Each queue provides buffering and retry semantics independently. If one downstream service is slow, its queue absorbs the backlog without affecting others.

SNS FIFO topics provide ordered, deduplicated delivery — but they can only deliver to SQS FIFO queues. Use SNS FIFO when you need fan-out with strict ordering. The ordering and deduplication guarantees propagate from the SNS topic through to the SQS FIFO queues.

Amazon Kinesis

Kinesis is a platform for real-time data streaming at scale. Where SQS and SNS are message-oriented — individual records consumed and deleted — Kinesis is stream-oriented: records are retained for a configurable period, and multiple independent consumers can read the same data independently.

Kinesis Data Streams

Data producers (applications, IoT devices, log agents, clickstream collectors) write records to a Kinesis Data Stream. The stream is divided into shards — the unit of capacity.

Each shard supports:

1 MB/s ingest (or 1,000 records per second write)
2 MB/s egress per consumer

Scale by adding shards. A stream with 10 shards handles 10 MB/s ingest. Records are ordered within a shard. A partition key determines which shard receives a record — records with the same partition key always land on the same shard. Use a high-cardinality partition key (user ID, session ID, device ID) to distribute load evenly across shards.

Data retention: 24 hours by default, extendable to up to 365 days. During the retention window, any consumer can read any record, replay from any position, or read at multiple speeds independently.

Shard Consumers: Classic vs Enhanced Fan-Out

Mode	Description	Throughput
Classic (GetRecords)	Consumer polls the shard. The 2 MB/s egress limit is shared across all consumers of that shard. Five consumers on one shard each get approximately 400 KB/s.	Shared 2 MB/s per shard
Enhanced Fan-Out	Each registered consumer receives a dedicated 2 MB/s throughput per shard. Data is pushed via HTTP/2. Eliminates consumer competition for bandwidth.	Dedicated 2 MB/s per consumer per shard

Enhanced Fan-Out is recommended when multiple independent services consume the same stream — for example, a real-time analytics pipeline, a monitoring pipeline, and an archival pipeline all reading the same clickstream simultaneously at full speed.

Kinesis Data Firehose

Firehose is a fully managed delivery service that reads from Kinesis Data Streams (or directly from producers) and delivers data to storage and analytics destinations:

Amazon S3: Buffer by size (1–128 MB) and time (60–900 seconds), then write as objects. Supports Snappy, GZIP, and ZIP compression.
Amazon Redshift: Deliver to S3 first, then issue a Redshift COPY command automatically.
Amazon OpenSearch Service: Index documents in real time.
Splunk, Datadog, MongoDB, HTTP endpoints: Third-party SaaS delivery.

Firehose handles batching, compression, error handling, and retry automatically. No consumer code required. Configure the destination and buffer settings and Firehose does the rest.

Lambda transformation: Attach a Lambda function to a Firehose delivery stream. Firehose passes batches of records to Lambda before delivery. Lambda can filter out unwanted records, parse formats, enrich records by calling external APIs, or convert JSON to Parquet. Records that Lambda marks as failed go to an S3 error bucket.

Kinesis Data Analytics

Run SQL queries or Apache Flink applications against streaming data in real time:

SQL mode: Write standard SQL against a streaming input. Define tumbling windows (fixed time), sliding windows, or session windows. Output results to Kinesis Data Streams or Firehose.
Apache Flink mode: Full Flink runtime managed by AWS. Complex stateful processing, joins across streams, exactly-once semantics, and machine learning inference.

Use cases: detect payment anomalies from a transaction stream, compute 60-second rolling averages of IoT sensor readings, join a clickstream with a product catalog stream, filter error events from application logs in real time.

SQS vs Kinesis

Dimension	SQS	Kinesis Data Streams
Delivery model	Queue — message processed by one consumer, then deleted	Stream — records retained, multiple independent readers
Replay	No — DLQ captures failures but no replay of normal messages	Yes — replay from any position within retention window
Ordering	Standard: best-effort. FIFO: strict within group	Strict ordering within shard
Multiple consumers	No — each message goes to one consumer	Yes — multiple consumers read same records independently
Retention	Up to 14 days	Up to 365 days
Throughput	Unlimited (Standard)	Bounded by shard count; scale shards to grow
Best for	Task queues, job distribution, decoupling	Real-time analytics, event sourcing, multi-consumer pipelines

Amazon EventBridge

EventBridge is a serverless event bus that routes events from sources to targets based on rules. It is the central nervous system for event-driven architectures on AWS — connecting AWS services, custom applications, and SaaS partners without writing polling or integration code.

Event Buses

Bus Type	Description
Default bus	Receives events from AWS services automatically. EC2 state changes, S3 object events when configured, CloudTrail API calls, CodePipeline stage transitions, and more.
Custom bus	Receives events from your application code via the `PutEvents` API. Create one bus per application domain or microservice boundary.
Partner bus	Receives events from SaaS partner services — Datadog, Zendesk, Shopify, Auth0, GitHub, PagerDuty, and others. Partners publish directly to your partner event bus.

Rules

A rule has two parts: a filter pattern and one or more targets.

Event pattern: A JSON object that matches events by field values. Match on source, detail-type, specific keys within the detail block, account ID, or region. Supports exact match, prefix, suffix, wildcard, numeric ranges, and anything-but negation.

Targets: Up to five targets per rule. When an event matches the pattern, EventBridge invokes all targets simultaneously:

Target	Notes
Lambda function	Most common. Invoked synchronously with the full event payload.
SQS queue	Queues the event for async processing. Absorbs traffic spikes.
SNS topic	Fan-out the event to multiple subscribers.
Step Functions state machine	Start a workflow execution with the event as input.
ECS task	Run a container task in response to the event.
API Gateway	HTTP POST to a REST or HTTP API endpoint.
Kinesis Data Stream or Firehose	Route events into a streaming pipeline.
Another EventBridge bus	Forward events across accounts or organizational units.

Input transformation: Before delivering to a target, EventBridge can transform the event payload. Extract specific fields, rename keys, add static values, or construct a new JSON document. This avoids the need for a Lambda adapter function just to reshape an event before passing it to a target.

Scheduled Rules

EventBridge supports scheduled invocation using:

Rate expressions: rate(5 minutes), rate(1 hour), rate(1 day)
Cron expressions: cron(0 12 * * ? *) — every day at noon UTC

Scheduled rules replace traditional cron jobs on servers. Invoke a Lambda function to clean up stale records, trigger a Step Functions workflow for nightly batch processing, or run an ECS task to generate reports. No server or cron daemon required.

Archive and Replay

EventBridge can archive events flowing through a bus — all events, or a filtered subset matching a pattern. Archives are stored in S3-backed storage managed by EventBridge. Configure an optional retention period.

Replay: Replay archived events back through the bus at any time. All rules evaluate replayed events as if they were new and invoke targets accordingly. Use replay to debug a new rule by testing it against historical events, reprocess events after a consumer bug is fixed, or populate a new downstream system with historical data.

Schema Registry

As events flow through EventBridge, the schema registry discovers and records their structure automatically. Schemas describe the detail field — field names, types, and nested objects.

From discovered schemas, EventBridge generates code bindings: strongly-typed classes and deserialization code for TypeScript, Python, Java, and Go. Your Lambda or application code receives a typed object instead of a raw JSON map, with IDE autocompletion for event fields.

Comparing the Services

Scenario	Recommended Service
Decouple two services, one consumer per message	SQS Standard Queue
Ordered processing, exactly-once, transactional	SQS FIFO Queue
Broadcast one event to multiple services simultaneously	SNS Topic
Fan-out with per-subscriber message filtering	SNS with filter policies
Real-time stream, multiple independent consumers, replay	Kinesis Data Streams
Deliver stream data to S3 or Redshift, no consumer code	Kinesis Data Firehose
Real-time stream SQL or Flink analytics	Kinesis Data Analytics
React to AWS service events, route to multiple targets	EventBridge
Schedule periodic invocations, replace cron jobs	EventBridge Scheduled Rules
Replay historical events for debugging or reprocessing	EventBridge Archive and Replay
Orchestrate multi-step workflows with retries and branches	Step Functions

Fan-Out Pattern: File Upload Pipeline

A common real-world composition: a user uploads a file to S3. Multiple downstream systems must react — thumbnail generation, metadata extraction, and a notification service. Each system is independent and scales separately.

User

►

PUT /uploads/photo.jpg

File upload completes

◄

S3 Event Notification

s3:ObjectCreated published to topic

◄

Fan-out — thumbnail service

SNS delivers copy to Queue A

◄

Fan-out — metadata extractor

SNS delivers copy to Queue B

◄

Fan-out — notification

SNS invokes Lambda directly

◄

Thumbnail worker polls Queue A

Resizes image, writes thumbnails to S3

◄

Metadata worker polls Queue B

Extracts EXIF data, writes to DynamoDB

◄

Lambda sends notification

Publishes to user via SNS SMS or email

Each service processes the same upload event independently. Queue A and Queue B provide buffering — if the thumbnail worker is temporarily overwhelmed, messages accumulate in the queue and are processed when capacity returns, without affecting the metadata extractor or the notification Lambda. If either consumer fails repeatedly, messages move to a DLQ for inspection without blocking the others.