Overview
The AWS Well-Architected Framework is a structured set of architectural best practices organized around six pillars. Each pillar addresses a distinct concern — operations, security, reliability, performance, cost, and environmental impact — and provides design principles, specific questions to evaluate your architecture against, and recommended AWS services and patterns.
The Framework is not a prescriptive blueprint. It is a checklist for evaluating trade-offs. No architecture perfectly satisfies all six pillars simultaneously. The process of working through the Framework surfaces gaps, prioritizes remediation, and forces explicit decisions about which trade-offs are acceptable for a given system.
AWS provides the Well-Architected Tool in the console, which codifies the Framework into a structured review process. This article covers the substance of each pillar and the key design decisions each one drives.
Pillar 1 — Operational Excellence
Operational excellence is the ability to run and monitor systems to deliver business value and to continuously improve processes and procedures. It is not a destination — it is a practice that evolves as the system and the team evolve.
Design Principles
- Perform operations as code. Define infrastructure and procedures in code. Use CloudFormation, CDK, or Terraform to provision infrastructure. Use AWS Systems Manager documents for operational runbooks. Code can be version-controlled, reviewed, and tested; manual procedures cannot.
- Make frequent, small, reversible changes. Small deployments fail smaller, are easier to diagnose, and can be rolled back quickly. Blue/green deployments, canary releases, and feature flags all enable small reversible changes.
- Refine operations procedures frequently. Game days, chaos engineering experiments, and post-incident reviews keep procedures sharp. Procedures that are never tested will fail at the worst moment.
- Anticipate failure. Identify potential sources of failure before they occur. Use pre-mortems. Design for failure at every layer rather than assuming components will be available.
- Learn from all operational events and failures. Every incident produces a post-incident review. Findings drive backlog items. The same failure should not happen twice.
Key Services and Tools
| Service | Role in Operational Excellence |
|---|---|
| AWS CloudFormation | Infrastructure as code. Define stacks in JSON or YAML. Stack updates are controlled, reviewable, and reversible. Drift detection identifies out-of-band changes. |
| AWS CDK | Infrastructure as code using TypeScript, Python, Java, or Go. Generates CloudFormation. Higher abstraction and better IDE support than raw CloudFormation. |
| AWS CloudWatch | Metrics, logs, alarms, and dashboards. The observability backbone. CloudWatch Alarms trigger Auto Scaling, SNS notifications, and Systems Manager automation. |
| AWS X-Ray | Distributed tracing. Visualizes request flows through microservices. Identifies latency bottlenecks and errors across service boundaries. |
| AWS Systems Manager | Operational toolbox: run commands, patch management (Patch Manager), session management (Session Manager — console access without SSH keys), parameter store for configuration, and automation documents. |
| AWS Config | Records configuration history of AWS resources. Config Rules flag non-compliant configurations. Remediates drift automatically or flags it for review. |
Example Design Decisions
- Use CloudFormation nested stacks to manage large infrastructure, not a single monolithic template
- Implement CI/CD pipelines with CodePipeline + CodeBuild that include integration tests before deploying to production
- Publish a standard set of CloudWatch dashboards with SLO-aligned metrics for every production service
- Define runbooks as SSM automation documents so on-call engineers execute the same steps every time
Pillar 2 — Security
Security encompasses identity and access management, detection, infrastructure protection, data protection, and incident response. Security must be addressed at every layer, not as a perimeter around the outside.
Design Principles
- Implement a strong identity foundation. Use IAM with least privilege. No long-term credentials for services — use roles. Enable MFA universally. Centralize identity management via IAM Identity Center.
- Enable traceability. CloudTrail logs every API call. VPC Flow Logs capture network traffic. Enable GuardDuty for threat detection. Log everything; you cannot investigate what you did not record.
- Apply security at all layers. Do not rely on network perimeter alone. Apply security groups at the instance level, NACLs at the subnet level, WAF at the application level, KMS encryption at the data level, and IAM at the API level. Defense in depth means an attacker who breaches one layer still faces another.
- Automate security best practices. Use Config Rules to detect and remediate non-compliant configurations automatically. Use AWS Security Hub to aggregate findings across accounts. Automate patching with Systems Manager Patch Manager.
- Protect data in transit and at rest. All data at rest should be encrypted using KMS or SSE. All traffic should use TLS. Enforce HTTPS on ALBs with redirect rules. Use ACM for certificate management.
- Keep people away from data. Build automated access patterns (SSM Session Manager instead of SSH, Systems Manager Run Command instead of manual access). Limit direct human access to production data stores.
- Prepare for security events. Run incident response exercises. Define and test your runbooks for common scenarios (compromised access key, S3 bucket exposed publicly, EC2 instance cryptomining).
Key Services and Tools
| Service | Role in Security |
|---|---|
| AWS KMS | Managed key service. Create and rotate customer managed keys. Enforces key usage policies. Integrated with S3, EBS, RDS, Secrets Manager, and most storage services. |
| AWS ACM | Provision, manage, and auto-renew TLS certificates for ALBs, CloudFront, API Gateway, and custom domains. Free for AWS-integrated resources. |
| Amazon GuardDuty | Continuous threat detection using ML on CloudTrail, VPC Flow Logs, and DNS logs. Identifies compromised instances, unauthorized API calls, unusual network patterns, and cryptocurrency mining. |
| AWS Security Hub | Aggregates findings from GuardDuty, Inspector, Macie, Firewall Manager, and third-party tools. Provides compliance posture scoring against CIS AWS Foundations Benchmark, PCI DSS, and NIST standards. |
| AWS WAF | Web Application Firewall at the ALB, CloudFront, or API Gateway layer. Block OWASP Top 10, SQL injection, XSS, bad bots, and custom rule sets. AWS Managed Rules provide instant coverage for known threats. |
| Amazon Inspector | Automated vulnerability scanning for EC2 instances (OS packages and CVEs), ECR container images, and Lambda functions. Continuously reassesses as new vulnerabilities are published. |
| Amazon Macie | ML-based sensitive data discovery in S3. Identifies PII, financial data, credentials. Generates findings for unencrypted or publicly accessible buckets containing sensitive data. |
Example Design Decisions
- Require MFA for all
iam:DeleteBucketPolicy,s3:PutBucketPublicAccessBlock, andec2:AuthorizeSecurityGroupIngresscalls via IAM condition - Encrypt all EBS volumes and enable account-level default encryption
- Use private subnets with NAT Gateway for all backend services — no public IPs on application servers
- Deploy WAF with AWS Managed Rules (Core Rule Set + Known Bad Inputs) on every public-facing ALB from day one
Pillar 3 — Reliability
Reliability is the ability of a workload to perform its intended function correctly and consistently when expected to. It encompasses recovery from infrastructure or service disruptions, the ability to scale to meet demand, and the ability to mitigate disruptions.
Design Principles
- Automatically recover from failure. Design systems to detect failures and recover without human intervention. CloudWatch Alarms triggering EC2 instance recovery, Auto Scaling replacing unhealthy instances, RDS Multi-AZ failing over automatically — all of these eliminate the need for an on-call engineer to execute a recovery runbook at 3am.
- Test recovery procedures. Regularly test failovers, restore from backups, and simulate AZ failures. Recovery procedures that have never been tested will fail during an actual incident.
- Scale horizontally. Replace one large resource with many small resources. Scaling horizontally means a single failure takes out a fraction of capacity rather than the whole system.
- Stop guessing capacity. Use Auto Scaling to match supply to demand. Both over-provisioning (wasted cost) and under-provisioning (service degradation) are failures of capacity planning.
- Manage change through automation. Changes to infrastructure made outside of automation introduce inconsistency and risk. Runbooks that change configuration manually are one fat-finger away from an outage.
Key Services and Tools
| Service | Role in Reliability |
|---|---|
| EC2 Auto Scaling | Replaces failed instances, scales out under load, and scales in during low demand. Works with Launch Templates and Target Tracking policies for automated capacity management. |
| Multi-AZ deployments | RDS Multi-AZ, ElastiCache Multi-AZ, OpenSearch Multi-AZ provide synchronous standby replicas that fail over automatically in the event of an AZ failure. |
| Amazon Route 53 | Health checks monitor endpoints and trigger DNS failover. Weighted, latency-based, failover, and geolocation routing policies provide flexible traffic management. |
| AWS Backup | Centralized backup service covering EBS, RDS, DynamoDB, EFS, FSx, and EC2. Define backup plans with retention policies. Cross-region and cross-account backup copies for disaster recovery. |
| Elastic Load Balancing | Distributes traffic across healthy targets in multiple AZs. Connection draining gracefully handles instance deregistration. |
| Amazon SQS | Decouples producers from consumers. Messages are durably stored (up to 14 days) so a consumer failure does not lose data. Dead-letter queues capture unprocessable messages for investigation. |
Example Design Decisions
- Deploy application tier across three AZs behind an ALB; use RDS Multi-AZ with a read replica in a third AZ for additional read scaling
- Define an RTO of 30 minutes and RPO of 5 minutes for the order processing service; implement AWS Backup with 5-minute RDS snapshots and a tested restore runbook
- Use SQS between order intake and fulfillment services so a fulfillment service outage does not cause order loss
- Run quarterly disaster recovery exercises that test full restoration from backup in a separate AWS account
Pillar 4 — Performance Efficiency
Performance efficiency is the ability to use computing resources efficiently to meet system requirements and to maintain that efficiency as demand changes and technologies evolve.
Design Principles
- Democratize advanced technologies. Use managed services for things that are hard to run yourself — NoSQL at scale (DynamoDB), in-memory caching (ElastiCache), managed streaming (Kinesis), ML inference (SageMaker endpoints). Spending engineering time managing Elasticsearch clusters is not differentiated work.
- Go global in minutes. Deploying to multiple regions requires almost no additional effort compared to single-region. CloudFormation StackSets, CodePipeline cross-region actions, and Route 53 geolocation routing enable multi-region architectures with modest additional complexity.
- Use serverless architectures. Lambda, Fargate, DynamoDB On-Demand — serverless removes capacity management entirely. Focus engineering effort on application logic.
- Experiment more often. Cloud infrastructure enables rapid A/B testing of architecture decisions. Run a compute-optimized instance type versus a memory-optimized one, measure the difference, and decide. The cost of experimentation is low.
- Consider mechanical sympathy. Match the right tool to the workload. OLTP → Aurora or RDS. OLAP → Redshift. Time-series → Timestream. Key-value access patterns → DynamoDB. Forcing relational workloads onto NoSQL, or key-value workloads onto relational databases, produces poor performance and high cost.
Key Services and Tools
| Service | Role in Performance Efficiency |
|---|---|
| Amazon CloudFront | CDN with global edge network. Sub-10ms content delivery to end users regardless of origin region. Dramatically reduces latency for static assets, SPAs, and cacheable API responses. |
| Amazon ElastiCache | Managed in-memory caching. Redis (ElastiCache for Redis) supports data structures, pub/sub, Lua scripting, and persistence. Memcached is simpler and multi-threaded. Reduces database load for read-heavy workloads. |
| Amazon Aurora | MySQL and PostgreSQL-compatible relational database with up to 5× MySQL throughput and 3× PostgreSQL throughput. Distributed storage across 6 copies in 3 AZs. Aurora Serverless v2 auto-scales in fine-grained increments. |
| AWS Lambda | Serverless compute. Instant scaling from zero to thousands of concurrent executions. Eliminates idle capacity cost. Compute time billed in 1ms increments. |
| AWS Graviton | ARM-based processors designed by AWS. Available for EC2 (M7g, C7g, R7g families), Lambda, RDS, ElastiCache, and others. 20–40% better price/performance than equivalent x86 instances for most workloads. |
Example Design Decisions
- Cache database read results in ElastiCache for Redis with a 60-second TTL; target cache hit rate >80% for the product catalog service
- Migrate from GP2 to GP3 EBS volumes to get 3,000 IOPS baseline at no additional cost versus GP2
- Run batch processing workloads on C7g (Graviton3) instances for 25% cost reduction versus C6i with no code changes
- Use Aurora Auto Scaling with reader endpoints to automatically add read replicas during peak traffic
Pillar 5 — Cost Optimization
Cost optimization is the ability to run systems that deliver business value at the lowest price point. This is not simply about spending less — it is about understanding where money goes, eliminating waste, and choosing the right consumption model for each workload.
Design Principles
- Implement cloud financial management. Cost optimization requires dedicated attention, tooling, and ownership. Assign cost accountability to engineering teams. Use tagging strategies to attribute costs to teams, products, and environments.
- Adopt a consumption model. Pay only for what you use. Turn off development environments on evenings and weekends using EC2 Instance Scheduler. Use Lambda for variable workloads rather than running EC2 instances 24/7 for traffic that arrives for two hours per day.
- Measure overall efficiency. Track the cost per business unit — cost per order processed, cost per API request, cost per active user. Efficiency targets are more useful than raw spend targets.
- Eliminate undifferentiated heavy lifting. Managed services cost more than self-managed per unit of compute, but they eliminate the engineering time required to operate, patch, and scale the underlying infrastructure. Engineer time is expensive.
- Analyze and attribute expenditure. Use tags consistently across all resources. Use Cost Allocation Tags in the billing console. Implement AWS Cost and Usage Report for granular analysis.
Key Services and Tools
| Service | Role in Cost Optimization |
|---|---|
| AWS Savings Plans | Commit to a specific dollar amount per hour ($/hr) for 1 or 3 years in exchange for up to 66% discount. Compute Savings Plans apply to EC2, Lambda, and Fargate regardless of instance family, size, or region. More flexible than Reserved Instances. |
| Spot Instances | Spare EC2 capacity at up to 90% discount. Interruptible with 2-minute warning. Best for fault-tolerant, stateless, batch workloads. Use Spot Fleet or Auto Scaling Groups with mixed instance policies to maintain capacity across multiple Spot pools. |
| AWS Cost Explorer | Visualize cost and usage over time. Hourly granularity. Filter by service, region, tag, linked account. Cost Explorer also provides Savings Plans and Reserved Instance recommendations based on actual usage history. |
| AWS Trusted Advisor | Checks across cost optimization, performance, security, fault tolerance, and service limits. Cost optimization checks include idle load balancers, underutilized EC2 instances, unassociated Elastic IPs, and unused RDS instances. |
| AWS Compute Optimizer | ML-based recommendations for right-sizing EC2 instances, ECS tasks, Lambda function memory, EBS volumes, and Auto Scaling Groups. Analyzes 14 days of CloudWatch metrics and recommends optimal configurations with projected cost and performance impact. |
Example Design Decisions
- Purchase Compute Savings Plans covering 70% of steady-state baseline compute; run remaining capacity as On-Demand and Spot
- Use Spot Instances for all batch data processing jobs with Auto Scaling Group mixed instance policy (70% Spot, 30% On-Demand) across 6 instance types to minimize interruption risk
- Tag all resources with
Environment,Team,CostCenter, andApplicationand enforce via SCP; monthly cost allocation report reviewed by engineering leads - Implement EC2 Instance Scheduler to shut down development environments outside business hours (saves ~65% of dev EC2 cost)
- Move infrequently accessed S3 data to S3 Intelligent-Tiering, eliminating manual lifecycle rule management while reducing storage cost by 40–60%
Pillar 6 — Sustainability
Sustainability is the newest pillar, addressing the environmental impact of running cloud workloads. The goal is to minimize the energy consumption and carbon footprint of cloud infrastructure without sacrificing the other five pillars.
Design Principles
- Understand your impact. Use the AWS Customer Carbon Footprint Tool to see your carbon emissions and track them over time as you optimize.
- Establish sustainability goals. Set improvement targets. Measure progress. Sustainability should be a first-class engineering objective, not an afterthought.
- Maximize utilization. Underutilized infrastructure consumes power without delivering proportional value. Right-size everything. Higher utilization per physical server means fewer servers run to do the same work.
- Anticipate and adopt more efficient hardware and software. Migrate to Graviton instances, upgrade to newer generation instance types, and adopt managed services as they become more energy-efficient.
- Use managed services. Managed services are operated at higher utilization than self-managed infrastructure. AWS can run DynamoDB at 80%+ utilization across the global fleet; a self-managed DynamoDB-equivalent would likely run at 20–30% utilization.
- Reduce the downstream impact of your cloud workloads. Efficient code, smaller payloads, appropriate caching, and CDN usage reduce the compute, storage, and network consumed per user interaction.
Key Services and Tools
| Service | Role in Sustainability |
|---|---|
| AWS Graviton | ARM-based processors deliver same or better performance at 60% lower energy use compared to equivalent x86 instances. Graviton3 is the most energy-efficient AWS processor currently available. |
| AWS Compute Optimizer | Right-sizing recommendations reduce idle compute capacity. Fewer running instances means lower energy consumption. |
| Amazon S3 Intelligent-Tiering | Moves objects to lower-cost, lower-energy storage tiers automatically. Cold data stored at lower energy per byte than hot storage. |
| Managed services broadly | RDS, Aurora, DynamoDB, Lambda, Fargate — all run at higher infrastructure utilization rates than equivalent self-managed deployments, translating to less energy per unit of work. |
| AWS Customer Carbon Footprint Tool | Monthly carbon emissions data by service, region, and account. Available in the AWS Cost and Usage Dashboard. |
Example Design Decisions
- Migrate all general-purpose EC2 workloads from M6i to M7g (Graviton3) — 40% cost reduction, 60% energy reduction per instance
- Replace self-managed Redis cluster (running at 15% utilization) with ElastiCache for Redis — higher fleet utilization, no idle capacity
- Implement aggressive CloudFront caching to reduce origin requests by 80% — same user experience, 80% less compute executed per request
- Adopt Lambda for event-driven workloads that were previously running on always-on EC2 — zero energy consumed when no requests arrive
Pillar Summary
| Pillar | Core Question | Key AWS Services | Example Design Decision |
|---|---|---|---|
| Operational Excellence | Can we run and improve this system sustainably? | CloudFormation, CloudWatch, X-Ray, SSM, Config | All infrastructure defined as CDK code; runbooks as SSM documents |
| Security | Is the data and system protected at every layer? | IAM, KMS, GuardDuty, Security Hub, WAF, Inspector | GuardDuty + Security Hub enabled in all regions; WAF on all public ALBs |
| Reliability | Will it recover from failures without human intervention? | Auto Scaling, Multi-AZ, Route 53, AWS Backup, SQS | Three-AZ deployment with automatic EC2 and RDS failover |
| Performance Efficiency | Are we using the right tools at the right size? | CloudFront, ElastiCache, Aurora, Lambda, Graviton | Migrate batch jobs to Graviton3; cache read-heavy data in ElastiCache |
| Cost Optimization | Are we spending efficiently and attributing cost correctly? | Savings Plans, Spot, Cost Explorer, Trusted Advisor, Compute Optimizer | 70% Savings Plan coverage; Spot for batch; enforced resource tagging |
| Sustainability | Are we minimizing environmental impact? | Graviton, Compute Optimizer, managed services, Customer Carbon Footprint Tool | All compute migrated to Graviton3; self-managed clusters replaced with managed services |
The AWS Well-Architected Tool
The AWS Well-Architected Tool is a free self-service questionnaire in the AWS Management Console that walks through the Framework systematically.
How a review works:
- Define a workload: name it, describe it, select the applicable lenses (the standard Framework, plus specialty lenses for serverless, SaaS, IoT, analytics, ML, financial services, and government).
- Answer questions for each pillar. Questions are structured as “Do you [specific practice]?” with four options: Yes, No, I don’t know, N/A.
- The tool generates a gap report: which best practices are not implemented, categorized by risk (High, Medium).
- Improvement plan: recommendations for each gap, linked to documentation and relevant AWS services.
- Track progress over time by saving milestone snapshots of each review.
The Well-Architected Tool is most valuable when used as a recurring practice — not a one-time review. Run it quarterly on critical workloads, and whenever a workload undergoes significant architectural change.
AWS partners can also conduct Well-Architected Reviews as a formal engagement, providing external perspective and access to AWS field team support for addressing findings.