AWS Governance & Observability

Overview

Governance in AWS is the set of controls that ensure environments are secure, compliant, cost-managed, and well-structured at scale. Without deliberate governance, accounts proliferate uncontrolled, permissions sprawl, audit trails disappear, and costs spike without warning.

Observability is the ability to understand the internal state of a system from its external outputs — metrics, logs, and traces. AWS governance and observability operate across four layers:

Organizational structure and guardrails — who can create accounts, what those accounts can do, and how accounts are grouped (AWS Organizations, SCPs, Control Tower)
API audit trail — a tamper-evident log of every AWS API call across all accounts (CloudTrail)
Configuration compliance — a continuous record of every resource’s configuration state and whether that state matches policy (AWS Config)
Operational observability — metrics, logs, alarms, dashboards, and anomaly detection for running workloads (CloudWatch, EventBridge)

None of these layers is optional in a production environment. Together they form the control plane that sits above individual services.

AWS Organizations

AWS Organizations groups multiple AWS accounts under a single management account. Large environments typically run dozens or hundreds of AWS accounts — one per application, environment, team, or business unit — rather than one monolithic account. The multi-account model reduces blast radius, enforces least-privilege at the account boundary, simplifies billing, and allows independent audit trails.

Account Hierarchy

Root
└── Management Account (Payer)
    ├── Security OU
    │   ├── Log Archive Account        ← centralized CloudTrail + Config logs
    │   └── Security Tooling Account   ← GuardDuty aggregation, Security Hub
    ├── Infrastructure OU
    │   ├── Network Account            ← shared VPCs, Transit Gateway
    │   └── Shared Services Account    ← Active Directory, monitoring
    ├── Workloads OU
    │   ├── Production OU
    │   │   ├── App-A Production Account
    │   │   └── App-B Production Account
    │   └── Non-Production OU
    │       ├── Staging Accounts
    │       └── Dev Accounts
    └── Sandbox OU
        └── Developer sandbox accounts

The management account created the organization and is the billing payer. It should run minimal workloads — it is the most sensitive account because it can act on all member accounts and is exempt from SCP restrictions.

Consolidated Billing

All member account charges roll up to the management account for a single monthly bill. Key benefits:

Volume pricing aggregation: AWS pricing tiers (S3 drops after the first 50 TB/month, data transfer discounts, etc.) aggregate across all accounts. Twenty accounts reach higher discount tiers faster than any single account would.
Reserved Instance and Savings Plans sharing: RIs and Savings Plans purchased in any account in the organization automatically apply to matching usage in other accounts (unless the management account disables sharing). This maximizes utilization of committed purchases.
Single payment method: One consolidated invoice with per-account line items.

Service Control Policies (SCPs)

SCPs are the primary governance mechanism for controlling what actions are permissible in member accounts. They are JSON policy documents (same syntax as IAM policies) attached to the root, an OU, or an individual account.

Fundamental principles:

SCPs define the maximum permissions available in an account — they are a permission ceiling, not a grant
An Allow in an SCP does not grant any permission; it only permits IAM policies in that account to grant the permission
A Deny in an SCP blocks that action for every identity in the account, including the account root user
Effective permissions = SCP does not deny AND IAM policy allows AND resource policy allows (for cross-account)
SCPs do not apply to the management account itself

Common SCP guardrails:

Guardrail	Effect
Deny actions outside approved regions	Prevent resource creation in unapproved regions
Deny `cloudtrail:StopLogging`, `DeleteTrail`	Ensure audit logging cannot be disabled
Deny `iam:CreateUser`	Force use of IAM Identity Center for all human access
Deny `organizations:LeaveOrganization`	Prevent accounts from detaching from central control
Deny `config:StopConfigurationRecorder`	Ensure compliance data collection continues
Deny RI and Savings Plans purchases from member accounts	Centralize purchasing decisions in management account
Deny creation of internet gateways outside the network account	Enforce centralized egress

SCPs are inherited down the OU hierarchy — an SCP on the Production OU applies to every account in that OU and any nested OUs beneath it.

AWS Resource Access Manager (RAM)

RAM enables sharing of AWS resources across accounts within an Organization without duplicating infrastructure. Instead of creating a Transit Gateway in every account, a network account creates one and shares it.

Shareable Resource	Common Use Case
VPC subnets	Central networking team creates subnets; business unit accounts deploy workloads into them
Transit Gateway	Share TGW attachments across accounts — centralized routing
Route 53 Resolver rules	Share DNS forwarding rules so all accounts resolve on-premises hostnames
License Manager configurations	Track licensed software usage centrally
AWS Glue Data Catalog databases	Share data catalog across analytics accounts

AWS Control Tower

Control Tower orchestrates Organizations, SCPs, CloudTrail, Config, and IAM Identity Center into an opinionated, pre-built multi-account baseline called a landing zone. Instead of manually wiring these services together — a process that takes weeks and requires deep expertise — Control Tower provisions the full governance stack in hours.

Landing Zone Structure

Control Tower creates three mandatory accounts in addition to the management account:

Account	Purpose
Management account	Payer, Control Tower administrator, SCP authority
Log Archive account	Centralized destination for CloudTrail and Config logs from all accounts
Audit account	Read-only access to all accounts for security teams; aggregates Security Hub, GuardDuty findings

Guardrails (Controls)

Control Tower uses the term “guardrails” for its governance rules. Two mechanism types:

Preventive guardrails: Implemented as SCPs. Block disallowed actions at the account level. Examples: disallow public S3 ACLs, disallow root user access keys, disallow changes to CloudTrail configuration.
Detective guardrails: Implemented as AWS Config rules. Detect non-compliant configurations after the fact without blocking them. Examples: detect whether MFA is enabled for root, detect whether EBS volumes are encrypted.

Guardrail categories:

Mandatory: Always enforced. Cannot be disabled. Cover baseline security hygiene.
Strongly recommended: Disabled by default, recommended by AWS. Enable as needed for your compliance posture.
Elective: Optional. Enable for specific regulatory or organizational requirements.

Account Factory

Account Factory is a Service Catalog product that vends new AWS accounts from a template. When a team requests a new account, Account Factory provisions it with:

CloudTrail enabled and reporting to the Log Archive account
Config enabled and reporting to the Log Archive account
IAM Identity Center permission sets assigned
Guardrails applied via SCPs from the OU
Baseline VPC and network configuration (if configured)

Account Factory for Terraform (AFT) extends this with a GitOps model — account requests are submitted as pull requests, and Terraform applies the configuration. Enables version-controlled, auditable account lifecycle management.

AWS CloudTrail

CloudTrail records every API call made to AWS services — from the console, CLI, SDK, or another AWS service acting on your behalf. It is the foundational audit log for the entire AWS control plane.

Event Types

Event Type	What It Captures	Default	Cost
Management events	Control plane operations: `CreateBucket`, `RunInstances`, `DeleteSecurityGroup`, `CreateUser`, `AssumeRole`, `AttachRolePolicy`	Enabled	First copy free per trail
Data events	Data plane operations: S3 `GetObject`/`PutObject`, Lambda `Invoke`, DynamoDB `PutItem`/`GetItem`	Disabled	Paid — generates high volume
Insights events	Unusual API activity: error rate spike, abnormal call volume for a given API	Disabled	Paid

Management events are the most critical for governance. Data events matter for PCI or HIPAA environments where every data access must be audited. Insights events are useful for detecting operational anomalies like a runaway automation script.

Event History vs Trails

Event History: 90 days of management events accessible in the CloudTrail console at no cost. No delivery to S3. No filtering or customization. Sufficient for ad-hoc investigation, insufficient for compliance retention.

Trails: A trail delivers a continuous stream of events to an S3 bucket and optionally to CloudWatch Logs and EventBridge.

Trail configuration options:

Setting	Description
All regions	Captures events from every region. Prevents blind spots. Strongly recommended.
Organization trail	Created in the management account — captures events from all member accounts. Delivers to a centralized S3 bucket.
Log file integrity validation	SHA-256 digest files, each referencing the previous — forms a hash chain. Validate with `aws cloudtrail validate-logs`. Proves logs have not been tampered with.
KMS encryption	Encrypt log files with a CMK. Restricts who can read the logs.
CloudWatch Logs delivery	Stream events to CloudWatch Logs for near-real-time alerting via metric filters and alarms.

CloudTrail Lake

CloudTrail Lake is a managed data lake for CloudTrail events — an alternative to the traditional pattern of S3 + Glue + Athena:

Events delivered directly to CloudTrail Lake (not S3)
Query with SQL directly in the console or via API — no Athena setup required
Aggregate events across all accounts and regions in a single query
Retention configurable up to 7 years
Cross-account analysis without cross-account S3 access configuration

CloudTrail Lake eliminates significant operational overhead for audit-at-scale use cases.

Limitation: CloudTrail is Not Real-Time

CloudTrail events are typically delivered within 15 minutes of the API call. CloudTrail is an audit log, not a real-time alerting system. For near-real-time detection, route CloudTrail to CloudWatch Logs and create metric filters that fire alarms on patterns of concern — root account logins, unauthorized API calls, SCP denials, or changes to security group rules.

AWS Config

CloudTrail records what happened (API calls). AWS Config records what resources look like — their configuration at every point in time. Config answers: “what was the configuration of this security group at 14:35 last Tuesday, and who changed it?”

How Config Works

Config operates through four core concepts:

Configuration recorder: Monitors supported AWS resource types. When a resource changes, Config captures a new configuration item.
Configuration item (CI): A snapshot of a resource’s full configuration — attributes, relationships, tags, IAM permissions — at a specific moment. Includes a reference to the CloudTrail event that triggered the change.
Configuration history: The complete timeline of CIs for a resource. Enables point-in-time reconstruction of any resource’s configuration state.
Configuration delivery channel: Delivers configuration snapshots and change notifications to an S3 bucket and optionally to an SNS topic.

Config Rules

Config Rules evaluate resource configurations against desired-state policies and mark resources as COMPLIANT or NON_COMPLIANT.

AWS managed rules (pre-built, no code required):

Rule	What It Checks
`s3-bucket-public-read-prohibited`	S3 buckets must not allow public read
`ec2-instance-no-public-ip`	EC2 instances must not have public IPs
`mfa-enabled-for-iam-console-access`	IAM users with console access must have MFA
`root-account-mfa-enabled`	Root account must have MFA enabled
`encrypted-volumes`	All EBS volumes must be encrypted
`rds-instance-public-access-check`	RDS instances must not be publicly accessible
`restricted-ssh`	Security groups must not allow unrestricted inbound SSH (0.0.0.0/0 on port 22)
`cloudtrail-enabled`	CloudTrail must be enabled
`vpc-flow-logs-enabled`	VPC Flow Logs must be enabled

Custom rules: Lambda-backed rules for organization-specific policies. The Lambda receives a resource’s configuration as input and returns COMPLIANT, NON_COMPLIANT, or NOT_APPLICABLE. Example: every EC2 instance must have a CostCenter tag — instances missing the tag are flagged as non-compliant.

Proactive rules: Some Config rules support proactive evaluation — check whether a resource would be compliant before it is created (via CloudFormation hooks or the StartResourceEvaluation API). Prevent non-compliant resources from being created in the first place.

Conformance Packs

Conformance Packs bundle multiple Config rules and optional remediation actions into a single deployable unit. Pre-built packs exist for:

Framework	Config Pack
CIS AWS Foundations Benchmark	`Operational-Best-Practices-for-CIS-AWS-v1.4-Level1`
PCI-DSS	`Operational-Best-Practices-for-PCI-DSS`
HIPAA	`Operational-Best-Practices-for-HIPAA-Security`
NIST 800-53	`Operational-Best-Practices-for-NIST-800-53`

Deploy an Organization Conformance Pack from the management account and all member accounts immediately get all rules applied and a centralized compliance dashboard.

Automated Remediation

Config rules integrate with AWS Systems Manager Automation to remediate non-compliant resources:

Non-Compliance	Automated Remediation
S3 bucket with public access enabled	Invoke SSM Automation: enable S3 Block Public Access on the bucket
Security group allows 0.0.0.0/0 on port 22	Invoke SSM Automation: revoke the inbound rule
IAM user missing MFA	Disable console access; notify user to configure MFA
EBS volume unencrypted	Alert only (remediation requires snapshot + re-encryption — manual review recommended)

Remediation can be set to automatic (triggers immediately on non-compliance) or manual (human reviews non-compliance and triggers remediation on demand).

Config vs CloudTrail

Dimension	AWS Config	CloudTrail
Primary question	What is/was this resource configured to do?	Who called which API, when, and from where?
Data model	Resource configuration snapshots over time	API call event records
Retention	Indefinite (delivered to S3)	90 days free; indefinite with trail
Compliance use	Evaluate resource state against policy; detect drift	Audit who performed actions; detect unauthorized calls
Best used together	Config shows a resource changed; CloudTrail shows what API call caused it

Amazon CloudWatch

CloudWatch is the native AWS observability service. It collects and stores metrics, aggregates logs, evaluates alarms, renders dashboards, and applies ML to detect anomalies — across every AWS service and every custom application in your environment.

Metrics

A metric is a time-ordered set of data points identified by a namespace, a metric name, and optional dimensions. AWS services publish metrics to CloudWatch automatically at no charge (at 5-minute resolution by default for most services).

Namespace: Groups related metrics by service. AWS/EC2, AWS/RDS, AWS/Lambda, AWS/ApplicationELB. Custom metrics from your applications use your own namespace.

Dimensions: Key-value pairs that identify a specific resource within a namespace. InstanceId=i-0abc1234 narrows CPUUtilization from all EC2 instances to one specific instance.

Metric Source	Resolution	Notes
EC2 basic monitoring	5 minutes	Free; CPU, network, disk I/O
EC2 detailed monitoring	1 minute	Paid; same metrics, finer granularity
CloudWatch Agent (on EC2)	Configurable	RAM, disk space, process counts, custom app metrics — not published by AWS by default
Custom via `PutMetricData` API	Any resolution ≥ 1 second	Application-level metrics from any source

RAM utilization and disk space usage are not published by AWS automatically — the CloudWatch Agent must be installed and configured on each instance to collect OS-level metrics.

CloudWatch Alarms

An alarm monitors a single metric (or metric math expression) and transitions between three states:

State	Meaning
`OK`	Metric is within the defined threshold
`ALARM`	Metric has breached the threshold for the required evaluation periods
`INSUFFICIENT_DATA`	Not enough data points to evaluate — often seen at startup

Alarm actions trigger on state transitions:

EC2 actions: Stop, start, terminate, or reboot the monitored instance
Auto Scaling: Scale out or scale in an Auto Scaling group
SNS: Publish a message to an SNS topic → email, SMS, HTTP endpoint, Lambda, PagerDuty, Slack

Composite Alarms: Combine multiple alarms with Boolean logic (AND, OR). Instead of alerting on CPU > 80% alone (which might be a false alarm on a healthy batch job), combine CPU > 80% AND memory > 85% AND disk > 90% → alert only when all three conditions are true simultaneously. Reduces alert noise.

CloudWatch Logs

CloudWatch Logs is the central log aggregation service for AWS. Services, applications, and the CloudWatch Agent stream log records to CloudWatch Logs.

Log structure:

Log group: Container for related logs from one application or service. Configure retention policy (1 day to 10 years, or never expire) and KMS encryption at the log group level.
Log stream: A sequence of events from a single source within the group. An ECS task might write to a stream per container, per task run.

Metric Filters: Scan incoming log records for a pattern and increment a custom CloudWatch metric when matched. Example: filter for [ERROR] in an application log → create the metric ApplicationErrorRate → alarm when it exceeds 50 per minute. This bridges unstructured logs into the metrics alerting system.

CloudWatch Logs Insights: Interactive SQL-like query language for ad-hoc log analysis:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) by bin(5m)
| sort @timestamp desc
| limit 200

Query across multiple log groups simultaneously. Visualize time series, distribution histograms, or tabular results. Pin queries to dashboards.

Log subscriptions: Stream log data to Kinesis Data Streams, Kinesis Data Firehose, or Lambda in real time. Use cases: forward logs to a third-party SIEM, transform and load logs to OpenSearch, archive logs to S3 with Firehose.

Additional CloudWatch Features

Anomaly Detection: CloudWatch trains an ML model on a metric’s historical behavior and creates a band of statistically expected values. Alarms fire when the metric falls outside the band without requiring manual threshold configuration. Automatically accounts for daily and weekly seasonality.

Composite Alarms with Anomaly Detection: Combine anomaly-based alarms with threshold alarms for more intelligent alerting.

CloudWatch Dashboards: Cross-account, cross-region dashboards that visualize any combination of metrics, log insights results, and alarm status in one view. Shared dashboards for NOC displays.

Container Insights: Collect metrics and logs from ECS clusters and EKS clusters at the cluster, service, task, and container level. Standard EC2 metrics only report at the instance level — Container Insights provides task-level and container-level granularity.

Synthetics: Scheduled canary scripts (Node.js or Python) that simulate user interactions with your application — loading a page, submitting a form, calling an API. Measures availability and latency from outside your application’s own instrumentation. Alerts when canaries fail or exceed latency thresholds.

Evidently: Feature flag management and A/B testing integrated with CloudWatch metrics. Gradually roll out features to a percentage of users, measure the metric impact, and roll back automatically if metrics degrade.

Amazon EventBridge for Governance Automation

EventBridge is the AWS event bus that connects AWS services, applications, and SaaS products in near real-time. For governance, EventBridge enables automated response to control plane events:

Source Event	EventBridge Rule	Target Action
Config rule non-compliance detected	Rule: `source = aws.config, detail.newEvaluationResult.complianceType = NON_COMPLIANT`	Lambda: remediate the resource
GuardDuty finding (high severity)	Rule: `source = aws.guardduty, detail.severity >= 7`	Step Functions: isolate instance, snapshot, notify security team
CloudTrail root account login	Rule: `source = aws.signin, detail.userIdentity.type = Root`	SNS: immediate alert to security on-call
EC2 instance launched without required tags	Rule: `source = aws.ec2, detail.eventName = RunInstances`	Lambda: tag check, stop instance if tag policy violated

EventBridge rules evaluate events in milliseconds — the closest thing to real-time governance response that AWS provides without custom infrastructure.

Governance Flow Example

Developer

AWS APIs

►

ec2:TerminateInstances on prod account

SCP on Production OU: Deny ec2:TerminateInstances

◄

AccessDenied — request blocked

SCP evaluation: explicit Deny overrides any IAM Allow

◄

Denied API call recorded

Event: errorCode=AccessDenied, eventName=TerminateInstances

◄

Event streamed to CloudWatch Logs

Log group: /aws/cloudtrail/organization-trail

◄

Metric filter: AccessDenied spike detected

Filter pattern: AccessDenied → metric FailedAPICallCount +1

◄

Alarm → SNS notification

FailedAPICallCount > 5 in 5 min → email [email protected]