AWS Governance & Observability

AWS-GOVERNANCE

Managing AWS at scale — Organizations, Control Tower, Service Control Policies, CloudTrail audit logging, AWS Config compliance, and CloudWatch observability.

awsorganizationscloudtrailconfigcloudwatchgovernancecompliance

Overview

Governance in AWS is the set of controls that ensure environments are secure, compliant, cost-managed, and well-structured at scale. Without deliberate governance, accounts proliferate uncontrolled, permissions sprawl, audit trails disappear, and costs spike without warning.

Observability is the ability to understand the internal state of a system from its external outputs — metrics, logs, and traces. AWS governance and observability operate across four layers:

  1. Organizational structure and guardrails — who can create accounts, what those accounts can do, and how accounts are grouped (AWS Organizations, SCPs, Control Tower)
  2. API audit trail — a tamper-evident log of every AWS API call across all accounts (CloudTrail)
  3. Configuration compliance — a continuous record of every resource’s configuration state and whether that state matches policy (AWS Config)
  4. Operational observability — metrics, logs, alarms, dashboards, and anomaly detection for running workloads (CloudWatch, EventBridge)

None of these layers is optional in a production environment. Together they form the control plane that sits above individual services.


AWS Organizations

AWS Organizations groups multiple AWS accounts under a single management account. Large environments typically run dozens or hundreds of AWS accounts — one per application, environment, team, or business unit — rather than one monolithic account. The multi-account model reduces blast radius, enforces least-privilege at the account boundary, simplifies billing, and allows independent audit trails.

Account Hierarchy

Root
└── Management Account (Payer)
    ├── Security OU
    │   ├── Log Archive Account        ← centralized CloudTrail + Config logs
    │   └── Security Tooling Account   ← GuardDuty aggregation, Security Hub
    ├── Infrastructure OU
    │   ├── Network Account            ← shared VPCs, Transit Gateway
    │   └── Shared Services Account    ← Active Directory, monitoring
    ├── Workloads OU
    │   ├── Production OU
    │   │   ├── App-A Production Account
    │   │   └── App-B Production Account
    │   └── Non-Production OU
    │       ├── Staging Accounts
    │       └── Dev Accounts
    └── Sandbox OU
        └── Developer sandbox accounts

The management account created the organization and is the billing payer. It should run minimal workloads — it is the most sensitive account because it can act on all member accounts and is exempt from SCP restrictions.

Consolidated Billing

All member account charges roll up to the management account for a single monthly bill. Key benefits:

Service Control Policies (SCPs)

SCPs are the primary governance mechanism for controlling what actions are permissible in member accounts. They are JSON policy documents (same syntax as IAM policies) attached to the root, an OU, or an individual account.

Fundamental principles:

Common SCP guardrails:

GuardrailEffect
Deny actions outside approved regionsPrevent resource creation in unapproved regions
Deny cloudtrail:StopLogging, DeleteTrailEnsure audit logging cannot be disabled
Deny iam:CreateUserForce use of IAM Identity Center for all human access
Deny organizations:LeaveOrganizationPrevent accounts from detaching from central control
Deny config:StopConfigurationRecorderEnsure compliance data collection continues
Deny RI and Savings Plans purchases from member accountsCentralize purchasing decisions in management account
Deny creation of internet gateways outside the network accountEnforce centralized egress

SCPs are inherited down the OU hierarchy — an SCP on the Production OU applies to every account in that OU and any nested OUs beneath it.

AWS Resource Access Manager (RAM)

RAM enables sharing of AWS resources across accounts within an Organization without duplicating infrastructure. Instead of creating a Transit Gateway in every account, a network account creates one and shares it.

Shareable ResourceCommon Use Case
VPC subnetsCentral networking team creates subnets; business unit accounts deploy workloads into them
Transit GatewayShare TGW attachments across accounts — centralized routing
Route 53 Resolver rulesShare DNS forwarding rules so all accounts resolve on-premises hostnames
License Manager configurationsTrack licensed software usage centrally
AWS Glue Data Catalog databasesShare data catalog across analytics accounts

AWS Control Tower

Control Tower orchestrates Organizations, SCPs, CloudTrail, Config, and IAM Identity Center into an opinionated, pre-built multi-account baseline called a landing zone. Instead of manually wiring these services together — a process that takes weeks and requires deep expertise — Control Tower provisions the full governance stack in hours.

Landing Zone Structure

Control Tower creates three mandatory accounts in addition to the management account:

AccountPurpose
Management accountPayer, Control Tower administrator, SCP authority
Log Archive accountCentralized destination for CloudTrail and Config logs from all accounts
Audit accountRead-only access to all accounts for security teams; aggregates Security Hub, GuardDuty findings

Guardrails (Controls)

Control Tower uses the term “guardrails” for its governance rules. Two mechanism types:

Guardrail categories:

Account Factory

Account Factory is a Service Catalog product that vends new AWS accounts from a template. When a team requests a new account, Account Factory provisions it with:

Account Factory for Terraform (AFT) extends this with a GitOps model — account requests are submitted as pull requests, and Terraform applies the configuration. Enables version-controlled, auditable account lifecycle management.


AWS CloudTrail

CloudTrail records every API call made to AWS services — from the console, CLI, SDK, or another AWS service acting on your behalf. It is the foundational audit log for the entire AWS control plane.

Event Types

Event TypeWhat It CapturesDefaultCost
Management eventsControl plane operations: CreateBucket, RunInstances, DeleteSecurityGroup, CreateUser, AssumeRole, AttachRolePolicyEnabledFirst copy free per trail
Data eventsData plane operations: S3 GetObject/PutObject, Lambda Invoke, DynamoDB PutItem/GetItemDisabledPaid — generates high volume
Insights eventsUnusual API activity: error rate spike, abnormal call volume for a given APIDisabledPaid

Management events are the most critical for governance. Data events matter for PCI or HIPAA environments where every data access must be audited. Insights events are useful for detecting operational anomalies like a runaway automation script.

Event History vs Trails

Event History: 90 days of management events accessible in the CloudTrail console at no cost. No delivery to S3. No filtering or customization. Sufficient for ad-hoc investigation, insufficient for compliance retention.

Trails: A trail delivers a continuous stream of events to an S3 bucket and optionally to CloudWatch Logs and EventBridge.

Trail configuration options:

SettingDescription
All regionsCaptures events from every region. Prevents blind spots. Strongly recommended.
Organization trailCreated in the management account — captures events from all member accounts. Delivers to a centralized S3 bucket.
Log file integrity validationSHA-256 digest files, each referencing the previous — forms a hash chain. Validate with aws cloudtrail validate-logs. Proves logs have not been tampered with.
KMS encryptionEncrypt log files with a CMK. Restricts who can read the logs.
CloudWatch Logs deliveryStream events to CloudWatch Logs for near-real-time alerting via metric filters and alarms.

CloudTrail Lake

CloudTrail Lake is a managed data lake for CloudTrail events — an alternative to the traditional pattern of S3 + Glue + Athena:

CloudTrail Lake eliminates significant operational overhead for audit-at-scale use cases.

Limitation: CloudTrail is Not Real-Time

CloudTrail events are typically delivered within 15 minutes of the API call. CloudTrail is an audit log, not a real-time alerting system. For near-real-time detection, route CloudTrail to CloudWatch Logs and create metric filters that fire alarms on patterns of concern — root account logins, unauthorized API calls, SCP denials, or changes to security group rules.


AWS Config

CloudTrail records what happened (API calls). AWS Config records what resources look like — their configuration at every point in time. Config answers: “what was the configuration of this security group at 14:35 last Tuesday, and who changed it?”

How Config Works

Config operates through four core concepts:

  1. Configuration recorder: Monitors supported AWS resource types. When a resource changes, Config captures a new configuration item.
  2. Configuration item (CI): A snapshot of a resource’s full configuration — attributes, relationships, tags, IAM permissions — at a specific moment. Includes a reference to the CloudTrail event that triggered the change.
  3. Configuration history: The complete timeline of CIs for a resource. Enables point-in-time reconstruction of any resource’s configuration state.
  4. Configuration delivery channel: Delivers configuration snapshots and change notifications to an S3 bucket and optionally to an SNS topic.

Config Rules

Config Rules evaluate resource configurations against desired-state policies and mark resources as COMPLIANT or NON_COMPLIANT.

AWS managed rules (pre-built, no code required):

RuleWhat It Checks
s3-bucket-public-read-prohibitedS3 buckets must not allow public read
ec2-instance-no-public-ipEC2 instances must not have public IPs
mfa-enabled-for-iam-console-accessIAM users with console access must have MFA
root-account-mfa-enabledRoot account must have MFA enabled
encrypted-volumesAll EBS volumes must be encrypted
rds-instance-public-access-checkRDS instances must not be publicly accessible
restricted-sshSecurity groups must not allow unrestricted inbound SSH (0.0.0.0/0 on port 22)
cloudtrail-enabledCloudTrail must be enabled
vpc-flow-logs-enabledVPC Flow Logs must be enabled

Custom rules: Lambda-backed rules for organization-specific policies. The Lambda receives a resource’s configuration as input and returns COMPLIANT, NON_COMPLIANT, or NOT_APPLICABLE. Example: every EC2 instance must have a CostCenter tag — instances missing the tag are flagged as non-compliant.

Proactive rules: Some Config rules support proactive evaluation — check whether a resource would be compliant before it is created (via CloudFormation hooks or the StartResourceEvaluation API). Prevent non-compliant resources from being created in the first place.

Conformance Packs

Conformance Packs bundle multiple Config rules and optional remediation actions into a single deployable unit. Pre-built packs exist for:

FrameworkConfig Pack
CIS AWS Foundations BenchmarkOperational-Best-Practices-for-CIS-AWS-v1.4-Level1
PCI-DSSOperational-Best-Practices-for-PCI-DSS
HIPAAOperational-Best-Practices-for-HIPAA-Security
NIST 800-53Operational-Best-Practices-for-NIST-800-53

Deploy an Organization Conformance Pack from the management account and all member accounts immediately get all rules applied and a centralized compliance dashboard.

Automated Remediation

Config rules integrate with AWS Systems Manager Automation to remediate non-compliant resources:

Non-ComplianceAutomated Remediation
S3 bucket with public access enabledInvoke SSM Automation: enable S3 Block Public Access on the bucket
Security group allows 0.0.0.0/0 on port 22Invoke SSM Automation: revoke the inbound rule
IAM user missing MFADisable console access; notify user to configure MFA
EBS volume unencryptedAlert only (remediation requires snapshot + re-encryption — manual review recommended)

Remediation can be set to automatic (triggers immediately on non-compliance) or manual (human reviews non-compliance and triggers remediation on demand).

Config vs CloudTrail

DimensionAWS ConfigCloudTrail
Primary questionWhat is/was this resource configured to do?Who called which API, when, and from where?
Data modelResource configuration snapshots over timeAPI call event records
RetentionIndefinite (delivered to S3)90 days free; indefinite with trail
Compliance useEvaluate resource state against policy; detect driftAudit who performed actions; detect unauthorized calls
Best used togetherConfig shows a resource changed; CloudTrail shows what API call caused it

Amazon CloudWatch

CloudWatch is the native AWS observability service. It collects and stores metrics, aggregates logs, evaluates alarms, renders dashboards, and applies ML to detect anomalies — across every AWS service and every custom application in your environment.

Metrics

A metric is a time-ordered set of data points identified by a namespace, a metric name, and optional dimensions. AWS services publish metrics to CloudWatch automatically at no charge (at 5-minute resolution by default for most services).

Namespace: Groups related metrics by service. AWS/EC2, AWS/RDS, AWS/Lambda, AWS/ApplicationELB. Custom metrics from your applications use your own namespace.

Dimensions: Key-value pairs that identify a specific resource within a namespace. InstanceId=i-0abc1234 narrows CPUUtilization from all EC2 instances to one specific instance.

Metric SourceResolutionNotes
EC2 basic monitoring5 minutesFree; CPU, network, disk I/O
EC2 detailed monitoring1 minutePaid; same metrics, finer granularity
CloudWatch Agent (on EC2)ConfigurableRAM, disk space, process counts, custom app metrics — not published by AWS by default
Custom via PutMetricData APIAny resolution ≥ 1 secondApplication-level metrics from any source

RAM utilization and disk space usage are not published by AWS automatically — the CloudWatch Agent must be installed and configured on each instance to collect OS-level metrics.

CloudWatch Alarms

An alarm monitors a single metric (or metric math expression) and transitions between three states:

StateMeaning
OKMetric is within the defined threshold
ALARMMetric has breached the threshold for the required evaluation periods
INSUFFICIENT_DATANot enough data points to evaluate — often seen at startup

Alarm actions trigger on state transitions:

Composite Alarms: Combine multiple alarms with Boolean logic (AND, OR). Instead of alerting on CPU > 80% alone (which might be a false alarm on a healthy batch job), combine CPU > 80% AND memory > 85% AND disk > 90% → alert only when all three conditions are true simultaneously. Reduces alert noise.

CloudWatch Logs

CloudWatch Logs is the central log aggregation service for AWS. Services, applications, and the CloudWatch Agent stream log records to CloudWatch Logs.

Log structure:

Metric Filters: Scan incoming log records for a pattern and increment a custom CloudWatch metric when matched. Example: filter for [ERROR] in an application log → create the metric ApplicationErrorRate → alarm when it exceeds 50 per minute. This bridges unstructured logs into the metrics alerting system.

CloudWatch Logs Insights: Interactive SQL-like query language for ad-hoc log analysis:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) by bin(5m)
| sort @timestamp desc
| limit 200

Query across multiple log groups simultaneously. Visualize time series, distribution histograms, or tabular results. Pin queries to dashboards.

Log subscriptions: Stream log data to Kinesis Data Streams, Kinesis Data Firehose, or Lambda in real time. Use cases: forward logs to a third-party SIEM, transform and load logs to OpenSearch, archive logs to S3 with Firehose.

Additional CloudWatch Features

Anomaly Detection: CloudWatch trains an ML model on a metric’s historical behavior and creates a band of statistically expected values. Alarms fire when the metric falls outside the band without requiring manual threshold configuration. Automatically accounts for daily and weekly seasonality.

Composite Alarms with Anomaly Detection: Combine anomaly-based alarms with threshold alarms for more intelligent alerting.

CloudWatch Dashboards: Cross-account, cross-region dashboards that visualize any combination of metrics, log insights results, and alarm status in one view. Shared dashboards for NOC displays.

Container Insights: Collect metrics and logs from ECS clusters and EKS clusters at the cluster, service, task, and container level. Standard EC2 metrics only report at the instance level — Container Insights provides task-level and container-level granularity.

Synthetics: Scheduled canary scripts (Node.js or Python) that simulate user interactions with your application — loading a page, submitting a form, calling an API. Measures availability and latency from outside your application’s own instrumentation. Alerts when canaries fail or exceed latency thresholds.

Evidently: Feature flag management and A/B testing integrated with CloudWatch metrics. Gradually roll out features to a percentage of users, measure the metric impact, and roll back automatically if metrics degrade.


Amazon EventBridge for Governance Automation

EventBridge is the AWS event bus that connects AWS services, applications, and SaaS products in near real-time. For governance, EventBridge enables automated response to control plane events:

Source EventEventBridge RuleTarget Action
Config rule non-compliance detectedRule: source = aws.config, detail.newEvaluationResult.complianceType = NON_COMPLIANTLambda: remediate the resource
GuardDuty finding (high severity)Rule: source = aws.guardduty, detail.severity >= 7Step Functions: isolate instance, snapshot, notify security team
CloudTrail root account loginRule: source = aws.signin, detail.userIdentity.type = RootSNS: immediate alert to security on-call
EC2 instance launched without required tagsRule: source = aws.ec2, detail.eventName = RunInstancesLambda: tag check, stop instance if tag policy violated

EventBridge rules evaluate events in milliseconds — the closest thing to real-time governance response that AWS provides without custom infrastructure.


Governance Flow Example

Developer
AWS APIs
ec2:TerminateInstances on prod account
SCP on Production OU: Deny ec2:TerminateInstances
AccessDenied — request blocked
SCP evaluation: explicit Deny overrides any IAM Allow
Denied API call recorded
Event: errorCode=AccessDenied, eventName=TerminateInstances
Event streamed to CloudWatch Logs
Log group: /aws/cloudtrail/organization-trail
Metric filter: AccessDenied spike detected
Filter pattern: AccessDenied → metric FailedAPICallCount +1
Alarm → SNS notification
FailedAPICallCount > 5 in 5 min → email [email protected]

References