AWS Well-Architected Framework — The 6 Pillars

Overview

The AWS Well-Architected Framework is a structured set of architectural best practices organized around six pillars. Each pillar addresses a distinct concern — operations, security, reliability, performance, cost, and environmental impact — and provides design principles, specific questions to evaluate your architecture against, and recommended AWS services and patterns.

The Framework is not a prescriptive blueprint. It is a checklist for evaluating trade-offs. No architecture perfectly satisfies all six pillars simultaneously. The process of working through the Framework surfaces gaps, prioritizes remediation, and forces explicit decisions about which trade-offs are acceptable for a given system.

AWS provides the Well-Architected Tool in the console, which codifies the Framework into a structured review process. This article covers the substance of each pillar and the key design decisions each one drives.

Pillar 1 — Operational Excellence

Operational excellence is the ability to run and monitor systems to deliver business value and to continuously improve processes and procedures. It is not a destination — it is a practice that evolves as the system and the team evolve.

Design Principles

Perform operations as code. Define infrastructure and procedures in code. Use CloudFormation, CDK, or Terraform to provision infrastructure. Use AWS Systems Manager documents for operational runbooks. Code can be version-controlled, reviewed, and tested; manual procedures cannot.
Make frequent, small, reversible changes. Small deployments fail smaller, are easier to diagnose, and can be rolled back quickly. Blue/green deployments, canary releases, and feature flags all enable small reversible changes.
Refine operations procedures frequently. Game days, chaos engineering experiments, and post-incident reviews keep procedures sharp. Procedures that are never tested will fail at the worst moment.
Anticipate failure. Identify potential sources of failure before they occur. Use pre-mortems. Design for failure at every layer rather than assuming components will be available.
Learn from all operational events and failures. Every incident produces a post-incident review. Findings drive backlog items. The same failure should not happen twice.

Key Services and Tools

Service	Role in Operational Excellence
AWS CloudFormation	Infrastructure as code. Define stacks in JSON or YAML. Stack updates are controlled, reviewable, and reversible. Drift detection identifies out-of-band changes.
AWS CDK	Infrastructure as code using TypeScript, Python, Java, or Go. Generates CloudFormation. Higher abstraction and better IDE support than raw CloudFormation.
AWS CloudWatch	Metrics, logs, alarms, and dashboards. The observability backbone. CloudWatch Alarms trigger Auto Scaling, SNS notifications, and Systems Manager automation.
AWS X-Ray	Distributed tracing. Visualizes request flows through microservices. Identifies latency bottlenecks and errors across service boundaries.
AWS Systems Manager	Operational toolbox: run commands, patch management (Patch Manager), session management (Session Manager — console access without SSH keys), parameter store for configuration, and automation documents.
AWS Config	Records configuration history of AWS resources. Config Rules flag non-compliant configurations. Remediates drift automatically or flags it for review.

Example Design Decisions

Use CloudFormation nested stacks to manage large infrastructure, not a single monolithic template
Implement CI/CD pipelines with CodePipeline + CodeBuild that include integration tests before deploying to production
Publish a standard set of CloudWatch dashboards with SLO-aligned metrics for every production service
Define runbooks as SSM automation documents so on-call engineers execute the same steps every time

Pillar 2 — Security

Security encompasses identity and access management, detection, infrastructure protection, data protection, and incident response. Security must be addressed at every layer, not as a perimeter around the outside.

Design Principles

Implement a strong identity foundation. Use IAM with least privilege. No long-term credentials for services — use roles. Enable MFA universally. Centralize identity management via IAM Identity Center.
Enable traceability. CloudTrail logs every API call. VPC Flow Logs capture network traffic. Enable GuardDuty for threat detection. Log everything; you cannot investigate what you did not record.
Apply security at all layers. Do not rely on network perimeter alone. Apply security groups at the instance level, NACLs at the subnet level, WAF at the application level, KMS encryption at the data level, and IAM at the API level. Defense in depth means an attacker who breaches one layer still faces another.
Automate security best practices. Use Config Rules to detect and remediate non-compliant configurations automatically. Use AWS Security Hub to aggregate findings across accounts. Automate patching with Systems Manager Patch Manager.
Protect data in transit and at rest. All data at rest should be encrypted using KMS or SSE. All traffic should use TLS. Enforce HTTPS on ALBs with redirect rules. Use ACM for certificate management.
Keep people away from data. Build automated access patterns (SSM Session Manager instead of SSH, Systems Manager Run Command instead of manual access). Limit direct human access to production data stores.
Prepare for security events. Run incident response exercises. Define and test your runbooks for common scenarios (compromised access key, S3 bucket exposed publicly, EC2 instance cryptomining).

Key Services and Tools

Service	Role in Security
AWS KMS	Managed key service. Create and rotate customer managed keys. Enforces key usage policies. Integrated with S3, EBS, RDS, Secrets Manager, and most storage services.
AWS ACM	Provision, manage, and auto-renew TLS certificates for ALBs, CloudFront, API Gateway, and custom domains. Free for AWS-integrated resources.
Amazon GuardDuty	Continuous threat detection using ML on CloudTrail, VPC Flow Logs, and DNS logs. Identifies compromised instances, unauthorized API calls, unusual network patterns, and cryptocurrency mining.
AWS Security Hub	Aggregates findings from GuardDuty, Inspector, Macie, Firewall Manager, and third-party tools. Provides compliance posture scoring against CIS AWS Foundations Benchmark, PCI DSS, and NIST standards.
AWS WAF	Web Application Firewall at the ALB, CloudFront, or API Gateway layer. Block OWASP Top 10, SQL injection, XSS, bad bots, and custom rule sets. AWS Managed Rules provide instant coverage for known threats.
Amazon Inspector	Automated vulnerability scanning for EC2 instances (OS packages and CVEs), ECR container images, and Lambda functions. Continuously reassesses as new vulnerabilities are published.
Amazon Macie	ML-based sensitive data discovery in S3. Identifies PII, financial data, credentials. Generates findings for unencrypted or publicly accessible buckets containing sensitive data.

Example Design Decisions

Require MFA for all iam:DeleteBucketPolicy, s3:PutBucketPublicAccessBlock, and ec2:AuthorizeSecurityGroupIngress calls via IAM condition
Encrypt all EBS volumes and enable account-level default encryption
Use private subnets with NAT Gateway for all backend services — no public IPs on application servers
Deploy WAF with AWS Managed Rules (Core Rule Set + Known Bad Inputs) on every public-facing ALB from day one

Pillar 3 — Reliability

Reliability is the ability of a workload to perform its intended function correctly and consistently when expected to. It encompasses recovery from infrastructure or service disruptions, the ability to scale to meet demand, and the ability to mitigate disruptions.

Design Principles

Automatically recover from failure. Design systems to detect failures and recover without human intervention. CloudWatch Alarms triggering EC2 instance recovery, Auto Scaling replacing unhealthy instances, RDS Multi-AZ failing over automatically — all of these eliminate the need for an on-call engineer to execute a recovery runbook at 3am.
Test recovery procedures. Regularly test failovers, restore from backups, and simulate AZ failures. Recovery procedures that have never been tested will fail during an actual incident.
Scale horizontally. Replace one large resource with many small resources. Scaling horizontally means a single failure takes out a fraction of capacity rather than the whole system.
Stop guessing capacity. Use Auto Scaling to match supply to demand. Both over-provisioning (wasted cost) and under-provisioning (service degradation) are failures of capacity planning.
Manage change through automation. Changes to infrastructure made outside of automation introduce inconsistency and risk. Runbooks that change configuration manually are one fat-finger away from an outage.

Key Services and Tools

Service	Role in Reliability
EC2 Auto Scaling	Replaces failed instances, scales out under load, and scales in during low demand. Works with Launch Templates and Target Tracking policies for automated capacity management.
Multi-AZ deployments	RDS Multi-AZ, ElastiCache Multi-AZ, OpenSearch Multi-AZ provide synchronous standby replicas that fail over automatically in the event of an AZ failure.
Amazon Route 53	Health checks monitor endpoints and trigger DNS failover. Weighted, latency-based, failover, and geolocation routing policies provide flexible traffic management.
AWS Backup	Centralized backup service covering EBS, RDS, DynamoDB, EFS, FSx, and EC2. Define backup plans with retention policies. Cross-region and cross-account backup copies for disaster recovery.
Elastic Load Balancing	Distributes traffic across healthy targets in multiple AZs. Connection draining gracefully handles instance deregistration.
Amazon SQS	Decouples producers from consumers. Messages are durably stored (up to 14 days) so a consumer failure does not lose data. Dead-letter queues capture unprocessable messages for investigation.

Example Design Decisions

Deploy application tier across three AZs behind an ALB; use RDS Multi-AZ with a read replica in a third AZ for additional read scaling
Define an RTO of 30 minutes and RPO of 5 minutes for the order processing service; implement AWS Backup with 5-minute RDS snapshots and a tested restore runbook
Use SQS between order intake and fulfillment services so a fulfillment service outage does not cause order loss
Run quarterly disaster recovery exercises that test full restoration from backup in a separate AWS account

Pillar 4 — Performance Efficiency

Performance efficiency is the ability to use computing resources efficiently to meet system requirements and to maintain that efficiency as demand changes and technologies evolve.

Design Principles

Democratize advanced technologies. Use managed services for things that are hard to run yourself — NoSQL at scale (DynamoDB), in-memory caching (ElastiCache), managed streaming (Kinesis), ML inference (SageMaker endpoints). Spending engineering time managing Elasticsearch clusters is not differentiated work.
Go global in minutes. Deploying to multiple regions requires almost no additional effort compared to single-region. CloudFormation StackSets, CodePipeline cross-region actions, and Route 53 geolocation routing enable multi-region architectures with modest additional complexity.
Use serverless architectures. Lambda, Fargate, DynamoDB On-Demand — serverless removes capacity management entirely. Focus engineering effort on application logic.
Experiment more often. Cloud infrastructure enables rapid A/B testing of architecture decisions. Run a compute-optimized instance type versus a memory-optimized one, measure the difference, and decide. The cost of experimentation is low.
Consider mechanical sympathy. Match the right tool to the workload. OLTP → Aurora or RDS. OLAP → Redshift. Time-series → Timestream. Key-value access patterns → DynamoDB. Forcing relational workloads onto NoSQL, or key-value workloads onto relational databases, produces poor performance and high cost.

Key Services and Tools

Service	Role in Performance Efficiency
Amazon CloudFront	CDN with global edge network. Sub-10ms content delivery to end users regardless of origin region. Dramatically reduces latency for static assets, SPAs, and cacheable API responses.
Amazon ElastiCache	Managed in-memory caching. Redis (ElastiCache for Redis) supports data structures, pub/sub, Lua scripting, and persistence. Memcached is simpler and multi-threaded. Reduces database load for read-heavy workloads.
Amazon Aurora	MySQL and PostgreSQL-compatible relational database with up to 5× MySQL throughput and 3× PostgreSQL throughput. Distributed storage across 6 copies in 3 AZs. Aurora Serverless v2 auto-scales in fine-grained increments.
AWS Lambda	Serverless compute. Instant scaling from zero to thousands of concurrent executions. Eliminates idle capacity cost. Compute time billed in 1ms increments.
AWS Graviton	ARM-based processors designed by AWS. Available for EC2 (M7g, C7g, R7g families), Lambda, RDS, ElastiCache, and others. 20–40% better price/performance than equivalent x86 instances for most workloads.

Example Design Decisions

Cache database read results in ElastiCache for Redis with a 60-second TTL; target cache hit rate >80% for the product catalog service
Migrate from GP2 to GP3 EBS volumes to get 3,000 IOPS baseline at no additional cost versus GP2
Run batch processing workloads on C7g (Graviton3) instances for 25% cost reduction versus C6i with no code changes
Use Aurora Auto Scaling with reader endpoints to automatically add read replicas during peak traffic

Pillar 5 — Cost Optimization

Cost optimization is the ability to run systems that deliver business value at the lowest price point. This is not simply about spending less — it is about understanding where money goes, eliminating waste, and choosing the right consumption model for each workload.

Design Principles

Implement cloud financial management. Cost optimization requires dedicated attention, tooling, and ownership. Assign cost accountability to engineering teams. Use tagging strategies to attribute costs to teams, products, and environments.
Adopt a consumption model. Pay only for what you use. Turn off development environments on evenings and weekends using EC2 Instance Scheduler. Use Lambda for variable workloads rather than running EC2 instances 24/7 for traffic that arrives for two hours per day.
Measure overall efficiency. Track the cost per business unit — cost per order processed, cost per API request, cost per active user. Efficiency targets are more useful than raw spend targets.
Eliminate undifferentiated heavy lifting. Managed services cost more than self-managed per unit of compute, but they eliminate the engineering time required to operate, patch, and scale the underlying infrastructure. Engineer time is expensive.
Analyze and attribute expenditure. Use tags consistently across all resources. Use Cost Allocation Tags in the billing console. Implement AWS Cost and Usage Report for granular analysis.

Key Services and Tools

Service	Role in Cost Optimization
AWS Savings Plans	Commit to a specific dollar amount per hour ($/hr) for 1 or 3 years in exchange for up to 66% discount. Compute Savings Plans apply to EC2, Lambda, and Fargate regardless of instance family, size, or region. More flexible than Reserved Instances.
Spot Instances	Spare EC2 capacity at up to 90% discount. Interruptible with 2-minute warning. Best for fault-tolerant, stateless, batch workloads. Use Spot Fleet or Auto Scaling Groups with mixed instance policies to maintain capacity across multiple Spot pools.
AWS Cost Explorer	Visualize cost and usage over time. Hourly granularity. Filter by service, region, tag, linked account. Cost Explorer also provides Savings Plans and Reserved Instance recommendations based on actual usage history.
AWS Trusted Advisor	Checks across cost optimization, performance, security, fault tolerance, and service limits. Cost optimization checks include idle load balancers, underutilized EC2 instances, unassociated Elastic IPs, and unused RDS instances.
AWS Compute Optimizer	ML-based recommendations for right-sizing EC2 instances, ECS tasks, Lambda function memory, EBS volumes, and Auto Scaling Groups. Analyzes 14 days of CloudWatch metrics and recommends optimal configurations with projected cost and performance impact.

Example Design Decisions

Purchase Compute Savings Plans covering 70% of steady-state baseline compute; run remaining capacity as On-Demand and Spot
Use Spot Instances for all batch data processing jobs with Auto Scaling Group mixed instance policy (70% Spot, 30% On-Demand) across 6 instance types to minimize interruption risk
Tag all resources with Environment, Team, CostCenter, and Application and enforce via SCP; monthly cost allocation report reviewed by engineering leads
Implement EC2 Instance Scheduler to shut down development environments outside business hours (saves ~65% of dev EC2 cost)
Move infrequently accessed S3 data to S3 Intelligent-Tiering, eliminating manual lifecycle rule management while reducing storage cost by 40–60%

Pillar 6 — Sustainability

Sustainability is the newest pillar, addressing the environmental impact of running cloud workloads. The goal is to minimize the energy consumption and carbon footprint of cloud infrastructure without sacrificing the other five pillars.

Design Principles

Understand your impact. Use the AWS Customer Carbon Footprint Tool to see your carbon emissions and track them over time as you optimize.
Establish sustainability goals. Set improvement targets. Measure progress. Sustainability should be a first-class engineering objective, not an afterthought.
Maximize utilization. Underutilized infrastructure consumes power without delivering proportional value. Right-size everything. Higher utilization per physical server means fewer servers run to do the same work.
Anticipate and adopt more efficient hardware and software. Migrate to Graviton instances, upgrade to newer generation instance types, and adopt managed services as they become more energy-efficient.
Use managed services. Managed services are operated at higher utilization than self-managed infrastructure. AWS can run DynamoDB at 80%+ utilization across the global fleet; a self-managed DynamoDB-equivalent would likely run at 20–30% utilization.
Reduce the downstream impact of your cloud workloads. Efficient code, smaller payloads, appropriate caching, and CDN usage reduce the compute, storage, and network consumed per user interaction.

Key Services and Tools

Service	Role in Sustainability
AWS Graviton	ARM-based processors deliver same or better performance at 60% lower energy use compared to equivalent x86 instances. Graviton3 is the most energy-efficient AWS processor currently available.
AWS Compute Optimizer	Right-sizing recommendations reduce idle compute capacity. Fewer running instances means lower energy consumption.
Amazon S3 Intelligent-Tiering	Moves objects to lower-cost, lower-energy storage tiers automatically. Cold data stored at lower energy per byte than hot storage.
Managed services broadly	RDS, Aurora, DynamoDB, Lambda, Fargate — all run at higher infrastructure utilization rates than equivalent self-managed deployments, translating to less energy per unit of work.
AWS Customer Carbon Footprint Tool	Monthly carbon emissions data by service, region, and account. Available in the AWS Cost and Usage Dashboard.

Example Design Decisions

Migrate all general-purpose EC2 workloads from M6i to M7g (Graviton3) — 40% cost reduction, 60% energy reduction per instance
Replace self-managed Redis cluster (running at 15% utilization) with ElastiCache for Redis — higher fleet utilization, no idle capacity
Implement aggressive CloudFront caching to reduce origin requests by 80% — same user experience, 80% less compute executed per request
Adopt Lambda for event-driven workloads that were previously running on always-on EC2 — zero energy consumed when no requests arrive

Pillar Summary

Pillar	Core Question	Key AWS Services	Example Design Decision
Operational Excellence	Can we run and improve this system sustainably?	CloudFormation, CloudWatch, X-Ray, SSM, Config	All infrastructure defined as CDK code; runbooks as SSM documents
Security	Is the data and system protected at every layer?	IAM, KMS, GuardDuty, Security Hub, WAF, Inspector	GuardDuty + Security Hub enabled in all regions; WAF on all public ALBs
Reliability	Will it recover from failures without human intervention?	Auto Scaling, Multi-AZ, Route 53, AWS Backup, SQS	Three-AZ deployment with automatic EC2 and RDS failover
Performance Efficiency	Are we using the right tools at the right size?	CloudFront, ElastiCache, Aurora, Lambda, Graviton	Migrate batch jobs to Graviton3; cache read-heavy data in ElastiCache
Cost Optimization	Are we spending efficiently and attributing cost correctly?	Savings Plans, Spot, Cost Explorer, Trusted Advisor, Compute Optimizer	70% Savings Plan coverage; Spot for batch; enforced resource tagging
Sustainability	Are we minimizing environmental impact?	Graviton, Compute Optimizer, managed services, Customer Carbon Footprint Tool	All compute migrated to Graviton3; self-managed clusters replaced with managed services

The AWS Well-Architected Tool

The AWS Well-Architected Tool is a free self-service questionnaire in the AWS Management Console that walks through the Framework systematically.

How a review works:

Define a workload: name it, describe it, select the applicable lenses (the standard Framework, plus specialty lenses for serverless, SaaS, IoT, analytics, ML, financial services, and government).
Answer questions for each pillar. Questions are structured as “Do you [specific practice]?” with four options: Yes, No, I don’t know, N/A.
The tool generates a gap report: which best practices are not implemented, categorized by risk (High, Medium).
Improvement plan: recommendations for each gap, linked to documentation and relevant AWS services.
Track progress over time by saving milestone snapshots of each review.

The Well-Architected Tool is most valuable when used as a recurring practice — not a one-time review. Run it quarterly on critical workloads, and whenever a workload undergoes significant architectural change.

Engineering Team

Well-Architected Tool

►

Define workload, select lenses

Describe scope: production order service, multi-AZ, 3 tiers

►

Answer pillar questions

~50-70 questions across 6 pillars

◄

Gap report generated

High Risk: no GuardDuty, no Multi-AZ for RDS, no Savings Plans

►

Create tickets for each high-risk finding

Prioritize by risk and implementation effort

►

Save milestone snapshot

Baseline for measuring progress in next review

►

Quarterly re-review

Confirm improvements closed, check for new gaps

AWS partners can also conduct Well-Architected Reviews as a formal engagement, providing external perspective and access to AWS field team support for addressing findings.