Overview
In a tightly coupled architecture, service A calls service B directly and synchronously. If B is slow, A waits. If B is down, A fails. If B cannot keep up with A’s request rate, A backs off or drops requests. Every dependent service becomes a potential failure propagation path.
Messaging services break these dependencies. A producer writes a message and moves on — it does not wait for, or even know about, the consumer. A consumer reads messages at its own pace — it does not need to be running when the producer sends. The two components operate independently: they can be scaled separately, deployed independently, fail without cascading, and be replaced without the other side noticing.
AWS offers a suite of messaging primitives that cover different patterns: SQS for point-to-point queuing, SNS for fan-out pub/sub, Kinesis for high-throughput real-time streaming, and EventBridge for event routing across AWS services, custom applications, and SaaS partners. These are not competing services — they compose. Understanding which primitive fits which problem is the foundation of event-driven architecture on AWS.
Amazon SQS — Simple Queue Service
SQS is a fully managed message queue. A producer sends messages to a queue. One or more consumers poll the queue, retrieve messages, process them, and delete them. SQS stores messages durably across multiple AZs until they are explicitly deleted.
Standard Queue
The Standard Queue is the default SQS type:
- At-least-once delivery: AWS guarantees every message will be delivered at least once, but a message may occasionally be delivered more than once. Consumers must be idempotent — processing a duplicate must produce the same result as processing it once.
- Best-effort ordering: Messages are generally delivered in the order they were sent, but order is not guaranteed. High-throughput distributed systems occasionally deliver messages out of sequence.
- Nearly unlimited throughput: Standard queues scale to any transaction rate. There is no throughput cap.
- Use cases: Task queues, job distribution, decoupling microservices where duplicate-safe processing is straightforward.
FIFO Queue
FIFO queues provide stronger guarantees at the cost of lower maximum throughput:
- Exactly-once processing: The queue deduplicates messages using a deduplication ID (content-based or explicit). A message sent twice within the deduplication window (5 minutes) is delivered once.
- Strict ordering: Messages are delivered in the exact order they were sent.
- Throughput limits: 300 transactions per second (TPS) without batching. 3,000 TPS with batching (up to 10 messages per API call). Higher throughput requires High Throughput FIFO mode (up to 30,000 TPS).
- Message Group ID: Groups messages into ordered sequences. Within a group, ordering is strict. Different groups can be processed in parallel by different consumers. Use message group ID to parallelize workloads while preserving order within logical entities — for example, all events for a given order ID always arrive in sequence.
Visibility Timeout
When a consumer reads a message with ReceiveMessage, the message is not deleted. It becomes invisible to all other consumers for the visibility timeout period. The consuming instance is expected to process the message and call DeleteMessage before the timeout expires.
If the consumer crashes, stalls, or fails to call DeleteMessage, the visibility timeout expires and the message becomes visible again — another consumer can pick it up. This prevents message loss when a consumer fails mid-processing.
Set the visibility timeout to be longer than your maximum expected processing time. If processing can take 30 seconds, set the timeout to 60 seconds. A consumer can call ChangeMessageVisibility to extend the timeout dynamically if processing takes longer than anticipated.
Dead-Letter Queue (DLQ)
A DLQ is a separate SQS queue that receives messages that fail processing repeatedly. Configure a maxReceiveCount on the source queue — if a message is received more than this many times without being deleted, SQS moves it to the DLQ automatically.
| Purpose | Detail |
|---|---|
| Debugging | Inspect failed messages without losing them. Examine why processing failed. |
| Alerting | Set a CloudWatch alarm on the DLQ ApproximateNumberOfMessagesVisible metric. |
| Reprocessing | After fixing the bug, replay messages from DLQ back to the source queue via the console or CLI. |
| Isolation | Failed messages do not block other messages from being processed. |
DLQs apply to both Standard and FIFO queues. A FIFO DLQ must be used with a FIFO source queue.
Long Polling
By default, ReceiveMessage returns immediately — even if the queue is empty. This short polling generates many empty API responses, increasing cost and CPU overhead on the consumer.
Long polling instructs SQS to wait up to 20 seconds (WaitTimeSeconds) for a message to arrive before returning. If a message arrives during the wait, it is returned immediately. If the timeout expires with no message, an empty response is returned. Long polling reduces empty responses by 90%+ and lowers SQS cost. Enable it on the queue via ReceiveMessageWaitTimeSeconds or per individual request.
Key Limits
| Attribute | Standard Queue | FIFO Queue |
|---|---|---|
| Message retention | 4 days default, up to 14 days | Same |
| Max message size | 256 KB | Same |
| Max long poll wait | 20 seconds | Same |
| Max visibility timeout | 12 hours | Same |
| Inflight messages | 120,000 | 20,000 |
For payloads larger than 256 KB, use the SQS Extended Client Library (Java/Python), which stores the message body in S3 and puts only the S3 reference in the SQS message. The consumer retrieves the reference, fetches from S3, and processes the full payload.
Amazon SNS — Simple Notification Service
SNS is a fully managed pub/sub messaging service. A publisher sends a message to an SNS topic. SNS immediately delivers the message to all subscribed endpoints simultaneously. The publisher has no knowledge of who the subscribers are or how many there are.
Subscription Types
| Subscription Protocol | Use Case |
|---|---|
| SQS | Fan-out to queue for async processing. Queue absorbs traffic spikes. |
| Lambda | Invoke Lambda function directly. |
| HTTP/HTTPS | POST message to a web endpoint. Requires subscription confirmation. |
| Email / Email-JSON | Human notification. Email-JSON delivers raw JSON payload. |
| SMS | Text message to a phone number via SNS mobile. |
| Kinesis Data Firehose | Route events to S3, Redshift, or OpenSearch via Firehose. |
| Mobile Push | Apple Push Notification Service (APNS), Firebase Cloud Messaging (FCM/GCM). |
Message Filtering
Without filtering, every subscriber receives every message published to the topic. Subscription filter policies allow each subscriber to declare which messages it wants. A filter policy is a JSON object specifying attribute conditions.
Example: An e-commerce order topic receives events of type order_placed, payment_failed, and order_shipped. The fulfillment service subscribes with a filter for order_placed only. The billing service subscribes with a filter for payment_failed. Neither service sees the other’s messages.
Filter policies evaluate message attributes — key-value pairs attached to the SNS message. Conditions can match exact values, a list of values, numeric ranges, or the presence or absence of an attribute. A message is delivered to a subscriber only if its attributes match the subscriber’s filter policy.
Fan-Out Pattern
The most common SNS + SQS pattern combines both services:
- A single event is published to an SNS topic.
- SNS delivers the event to multiple SQS queues simultaneously.
- Each queue has its own consumer service that processes the event independently.
This achieves true fan-out: multiple downstream systems process the same event without knowing about each other. Each queue provides buffering and retry semantics independently. If one downstream service is slow, its queue absorbs the backlog without affecting others.
FIFO SNS Topics
SNS FIFO topics provide ordered, deduplicated delivery — but they can only deliver to SQS FIFO queues. Use SNS FIFO when you need fan-out with strict ordering. The ordering and deduplication guarantees propagate from the SNS topic through to the SQS FIFO queues.
Amazon Kinesis
Kinesis is a platform for real-time data streaming at scale. Where SQS and SNS are message-oriented — individual records consumed and deleted — Kinesis is stream-oriented: records are retained for a configurable period, and multiple independent consumers can read the same data independently.
Kinesis Data Streams
Data producers (applications, IoT devices, log agents, clickstream collectors) write records to a Kinesis Data Stream. The stream is divided into shards — the unit of capacity.
Each shard supports:
- 1 MB/s ingest (or 1,000 records per second write)
- 2 MB/s egress per consumer
Scale by adding shards. A stream with 10 shards handles 10 MB/s ingest. Records are ordered within a shard. A partition key determines which shard receives a record — records with the same partition key always land on the same shard. Use a high-cardinality partition key (user ID, session ID, device ID) to distribute load evenly across shards.
Data retention: 24 hours by default, extendable to up to 365 days. During the retention window, any consumer can read any record, replay from any position, or read at multiple speeds independently.
Shard Consumers: Classic vs Enhanced Fan-Out
| Mode | Description | Throughput |
|---|---|---|
| Classic (GetRecords) | Consumer polls the shard. The 2 MB/s egress limit is shared across all consumers of that shard. Five consumers on one shard each get approximately 400 KB/s. | Shared 2 MB/s per shard |
| Enhanced Fan-Out | Each registered consumer receives a dedicated 2 MB/s throughput per shard. Data is pushed via HTTP/2. Eliminates consumer competition for bandwidth. | Dedicated 2 MB/s per consumer per shard |
Enhanced Fan-Out is recommended when multiple independent services consume the same stream — for example, a real-time analytics pipeline, a monitoring pipeline, and an archival pipeline all reading the same clickstream simultaneously at full speed.
Kinesis Data Firehose
Firehose is a fully managed delivery service that reads from Kinesis Data Streams (or directly from producers) and delivers data to storage and analytics destinations:
- Amazon S3: Buffer by size (1–128 MB) and time (60–900 seconds), then write as objects. Supports Snappy, GZIP, and ZIP compression.
- Amazon Redshift: Deliver to S3 first, then issue a Redshift COPY command automatically.
- Amazon OpenSearch Service: Index documents in real time.
- Splunk, Datadog, MongoDB, HTTP endpoints: Third-party SaaS delivery.
Firehose handles batching, compression, error handling, and retry automatically. No consumer code required. Configure the destination and buffer settings and Firehose does the rest.
Lambda transformation: Attach a Lambda function to a Firehose delivery stream. Firehose passes batches of records to Lambda before delivery. Lambda can filter out unwanted records, parse formats, enrich records by calling external APIs, or convert JSON to Parquet. Records that Lambda marks as failed go to an S3 error bucket.
Kinesis Data Analytics
Run SQL queries or Apache Flink applications against streaming data in real time:
- SQL mode: Write standard SQL against a streaming input. Define tumbling windows (fixed time), sliding windows, or session windows. Output results to Kinesis Data Streams or Firehose.
- Apache Flink mode: Full Flink runtime managed by AWS. Complex stateful processing, joins across streams, exactly-once semantics, and machine learning inference.
Use cases: detect payment anomalies from a transaction stream, compute 60-second rolling averages of IoT sensor readings, join a clickstream with a product catalog stream, filter error events from application logs in real time.
SQS vs Kinesis
| Dimension | SQS | Kinesis Data Streams |
|---|---|---|
| Delivery model | Queue — message processed by one consumer, then deleted | Stream — records retained, multiple independent readers |
| Replay | No — DLQ captures failures but no replay of normal messages | Yes — replay from any position within retention window |
| Ordering | Standard: best-effort. FIFO: strict within group | Strict ordering within shard |
| Multiple consumers | No — each message goes to one consumer | Yes — multiple consumers read same records independently |
| Retention | Up to 14 days | Up to 365 days |
| Throughput | Unlimited (Standard) | Bounded by shard count; scale shards to grow |
| Best for | Task queues, job distribution, decoupling | Real-time analytics, event sourcing, multi-consumer pipelines |
Amazon EventBridge
EventBridge is a serverless event bus that routes events from sources to targets based on rules. It is the central nervous system for event-driven architectures on AWS — connecting AWS services, custom applications, and SaaS partners without writing polling or integration code.
Event Buses
| Bus Type | Description |
|---|---|
| Default bus | Receives events from AWS services automatically. EC2 state changes, S3 object events when configured, CloudTrail API calls, CodePipeline stage transitions, and more. |
| Custom bus | Receives events from your application code via the PutEvents API. Create one bus per application domain or microservice boundary. |
| Partner bus | Receives events from SaaS partner services — Datadog, Zendesk, Shopify, Auth0, GitHub, PagerDuty, and others. Partners publish directly to your partner event bus. |
Rules
A rule has two parts: a filter pattern and one or more targets.
Event pattern: A JSON object that matches events by field values. Match on source, detail-type, specific keys within the detail block, account ID, or region. Supports exact match, prefix, suffix, wildcard, numeric ranges, and anything-but negation.
Targets: Up to five targets per rule. When an event matches the pattern, EventBridge invokes all targets simultaneously:
| Target | Notes |
|---|---|
| Lambda function | Most common. Invoked synchronously with the full event payload. |
| SQS queue | Queues the event for async processing. Absorbs traffic spikes. |
| SNS topic | Fan-out the event to multiple subscribers. |
| Step Functions state machine | Start a workflow execution with the event as input. |
| ECS task | Run a container task in response to the event. |
| API Gateway | HTTP POST to a REST or HTTP API endpoint. |
| Kinesis Data Stream or Firehose | Route events into a streaming pipeline. |
| Another EventBridge bus | Forward events across accounts or organizational units. |
Input transformation: Before delivering to a target, EventBridge can transform the event payload. Extract specific fields, rename keys, add static values, or construct a new JSON document. This avoids the need for a Lambda adapter function just to reshape an event before passing it to a target.
Scheduled Rules
EventBridge supports scheduled invocation using:
- Rate expressions:
rate(5 minutes),rate(1 hour),rate(1 day) - Cron expressions:
cron(0 12 * * ? *)— every day at noon UTC
Scheduled rules replace traditional cron jobs on servers. Invoke a Lambda function to clean up stale records, trigger a Step Functions workflow for nightly batch processing, or run an ECS task to generate reports. No server or cron daemon required.
Archive and Replay
EventBridge can archive events flowing through a bus — all events, or a filtered subset matching a pattern. Archives are stored in S3-backed storage managed by EventBridge. Configure an optional retention period.
Replay: Replay archived events back through the bus at any time. All rules evaluate replayed events as if they were new and invoke targets accordingly. Use replay to debug a new rule by testing it against historical events, reprocess events after a consumer bug is fixed, or populate a new downstream system with historical data.
Schema Registry
As events flow through EventBridge, the schema registry discovers and records their structure automatically. Schemas describe the detail field — field names, types, and nested objects.
From discovered schemas, EventBridge generates code bindings: strongly-typed classes and deserialization code for TypeScript, Python, Java, and Go. Your Lambda or application code receives a typed object instead of a raw JSON map, with IDE autocompletion for event fields.
Comparing the Services
| Scenario | Recommended Service |
|---|---|
| Decouple two services, one consumer per message | SQS Standard Queue |
| Ordered processing, exactly-once, transactional | SQS FIFO Queue |
| Broadcast one event to multiple services simultaneously | SNS Topic |
| Fan-out with per-subscriber message filtering | SNS with filter policies |
| Real-time stream, multiple independent consumers, replay | Kinesis Data Streams |
| Deliver stream data to S3 or Redshift, no consumer code | Kinesis Data Firehose |
| Real-time stream SQL or Flink analytics | Kinesis Data Analytics |
| React to AWS service events, route to multiple targets | EventBridge |
| Schedule periodic invocations, replace cron jobs | EventBridge Scheduled Rules |
| Replay historical events for debugging or reprocessing | EventBridge Archive and Replay |
| Orchestrate multi-step workflows with retries and branches | Step Functions |
Fan-Out Pattern: File Upload Pipeline
A common real-world composition: a user uploads a file to S3. Multiple downstream systems must react — thumbnail generation, metadata extraction, and a notification service. Each system is independent and scales separately.
Each service processes the same upload event independently. Queue A and Queue B provide buffering — if the thumbnail worker is temporarily overwhelmed, messages accumulate in the queue and are processed when capacity returns, without affecting the metadata extractor or the notification Lambda. If either consumer fails repeatedly, messages move to a DLQ for inspection without blocking the others.