GCP — Reliability and Cost Optimisation

RELIABILITY

Designing reliable GCP architectures — SRE principles, high availability patterns, DR planning, and cost optimisation strategies.

gcpgoogle-cloudreliabilitysrehigh-availabilitycost-optimisationdisaster-recovery

Overview

Reliability engineering on GCP operates at two levels. The first is architectural: which design choices make a system resilient to the failure modes that GCP’s infrastructure actually exhibits — zone outages, regional degradation, service quota exhaustion, and dependency failures. The second is operational: how do teams measure, budget, and continuously improve reliability after deployment?

Google pioneered Site Reliability Engineering (SRE) as a discipline and published its practices openly. The GCP platform is built with these practices in mind: its services offer granular SLAs, its monitoring stack surfaces the SLIs needed to track SLOs, and its cost tooling provides the visibility required to balance reliability investment against its price. This article covers both the architectural patterns and the operational practices, plus the cost optimisation levers that allow teams to run reliable systems without overspending.


SRE Principles

SLI, SLO, SLA, and Error Budget

These four terms define the reliability contract for any service:

TermDefinitionExample
SLI (Service Level Indicator)A quantitative measure of service behaviourPercentage of HTTP requests completing in < 200ms
SLO (Service Level Objective)The internal target value for an SLI99.9% of requests complete in < 200ms per 30-day window
SLA (Service Level Agreement)The contractual commitment to customers; breach triggers penalties99.5% availability; below this, credits are issued
Error Budget100% − SLO target; the allowable unreliability per period0.1% of requests per 30 days may fail or be slow

The SLO is always stricter than the SLA. The gap between them is the safety margin — if the SLO is 99.9% but the SLA is 99.5%, the 0.4% buffer protects against SLA breach while the SLO is occasionally missed.

Error budgets are the key operational mechanism. Every deployment, configuration change, or infrastructure modification consumes error budget (it introduces risk of failure). When the error budget is exhausted before the end of the measurement window, the SRE practice is to freeze new releases and focus exclusively on reliability work until the budget resets. This creates a natural forcing function for developers and reliability engineers to collaborate: developers want to ship features, but they also want to preserve error budget so they can keep shipping.

Overload Handling

When a system receives more traffic than it can handle, three strategies exist:

Cascading Failure Prevention

One of the most dangerous reliability failure modes is a cascading failure: Service A depends on Service B, B becomes slow or unavailable, A’s threads block waiting for B, A runs out of thread pool capacity, and A also becomes unavailable — propagating the outage upstream. Prevention mechanisms include:


Designing for Failure

Redundancy and Single Points of Failure

The first rule of reliable architecture is to have no single point of failure. On GCP, redundancy has three main dimensions:

Zonal redundancy deploys resources across multiple availability zones within a single region. A GCP region has at least three zones, each an independent failure domain with separate power, cooling, and networking. Zone failures happen rarely but do occur. Running workloads across multiple zones ensures that a zone failure takes out only a fraction of capacity rather than the entire service.

Regional redundancy deploys resources across two or more GCP regions. This protects against region-level events (severe weather, major network incidents, or the unlikely but real possibility of a full-region outage). Regional redundancy adds latency and cost but is required for very high availability tiers.

Service redundancy means not relying on a single instance of a dependency. Every critical database should have a replica. Every external API call should have a fallback. Every cache should have a read-through path to the backing store.

GCP HA Patterns

Managed Instance Groups (MIGs) are the foundation of compute redundancy on GCP. A regional MIG distributes VM instances across all zones in a region. If one zone fails, the MIG’s autoscaler detects the capacity loss and launches replacement instances in the remaining zones. Regional MIGs should always be preferred over zonal MIGs for production workloads.

Global External HTTP(S) Load Balancer distributes traffic to backends in multiple regions using a single anycast IP address. Users are routed to the nearest healthy backend. If a region’s backends fail their health checks, traffic is automatically redirected to the next nearest region. This is GCP’s flagship HA pattern for internet-facing web applications.

Cloud SQL HA provisions a standby replica in a different zone within the same region. Replication is synchronous — writes are not acknowledged until they land on both primary and standby. If the primary zone fails, Cloud SQL automatically promotes the standby replica. Failover typically completes in under 60 seconds, and the connection string remains the same.

Cloud Storage multi-region buckets store data across at least two geographically separated GCP regions within a continent (e.g., US multi-region spans multiple US regions). Durability is 11 nines (99.999999999%). Data is accessible even if an entire region becomes unavailable.

PatternProtects AgainstCost Impact
Regional MIG (multi-zone)Zone failureMinimal (spread across zones in one region)
Global HTTP(S) LBRegion failureModerate (cross-region egress costs)
Cloud SQL HAZone failure2x instance cost
Cloud SQL cross-region replicaRegion failureAdditional replica + egress costs
Cloud Spanner multi-regionRegion failureHigher per-node cost
Cloud Storage multi-regionRegion failureSlightly higher storage per GB

Disaster Recovery

RPO and RTO

Every DR plan is defined by two objectives:

A system with an RTO of 1 hour and an RPO of 15 minutes must be able to recover to operation within 60 minutes of an outage, with no more than 15 minutes of transaction data missing. Meeting tighter objectives requires more infrastructure that must remain always-on or near-ready, which increases cost.

DR Strategies

GCP architectures typically map to four DR strategy tiers:

StrategyRTORPOCostDescription
Backup and RestoreHours to daysHoursLowRegular backups to Cloud Storage; restore from scratch on failure
Pilot LightHoursMinutesMediumCore infrastructure (DNS, load balancer config, small DB replica) running; expand on failure
Warm StandbyMinutesSecondsHighScaled-down duplicate environment always running; scale up on failure
Multi-site Active/ActiveNear zeroNear zeroVery highFull duplicate running in a second region; instant failover with live traffic split

The backup-and-restore strategy is appropriate for development environments, non-critical workloads, and systems where hours of downtime are acceptable. The multi-site active/active strategy is appropriate for globally critical services where even seconds of downtime translate to significant financial or reputational harm.

Pilot light is the most common pattern for mid-tier production workloads. The DR region has a pre-configured VPC, a running Cloud SQL read replica (which can be promoted to primary), and a dormant MIG instance template ready to scale out. When disaster strikes, promotion of the Cloud SQL replica and scaling the MIG can restore service in 30–60 minutes.


Cost Optimisation

Reliability costs money — redundant infrastructure, cross-region replication, and always-on standby all carry price tags. Cost optimisation is not about cutting corners on reliability; it is about avoiding waste so that investment can be directed toward the reliability measures that actually matter.

Commitment Discounts

GCP offers two types of discount for Compute Engine and Cloud SQL:

Sustained Use Discounts (SUDs) are automatic. When a VM runs for more than 25% of a calendar month, GCP begins applying a discount that grows linearly up to 30% off on-demand pricing at 100% monthly usage. No action or commitment is required — the discount appears automatically on the billing invoice.

Committed Use Discounts (CUDs) require a 1-year or 3-year purchase commitment for a specific resource type in a specific region. In exchange, GCP provides up to 57% off on-demand pricing for a 3-year commitment. CUDs apply at the project level. Critically, if you delete the resource before the commitment period ends, you still pay for it — the commitment is for the resource type and quantity, not for a specific VM instance.

Discount TypeAction RequiredDiscount LevelFlexibility
Sustained UseNone (automatic)Up to 30%Full — no commitment
Committed Use (1-year)Purchase commitmentUp to 37%Low — must use for 1 year
Committed Use (3-year)Purchase commitmentUp to 57%Very low — must use for 3 years
Preemptible / Spot VMsUse fault-tolerant workloadsUp to 80–90%Medium — workload must tolerate preemption

Preemptible and Spot VMs

For fault-tolerant batch workloads — data processing, rendering, ML training, load testing — Spot VMs (the successor to Preemptible VMs) provide up to 90% cost savings. GCP can reclaim Spot VMs with 30 seconds notice. Applications must checkpoint progress regularly and handle preemption gracefully. Spot VMs are inappropriate for databases, stateful services, or user-facing workloads with strict latency requirements.

The recommended pattern for Dataproc batch jobs is to use 2 primary nodes (non-Spot) for the YARN resource manager and run the bulk of worker nodes as Spot, cutting cluster costs by 60–80% while retaining fault tolerance through Spark’s built-in task retry logic.

Rightsizing

Over-provisioned VMs are one of the most common sources of cloud waste. A VM running at 10% average CPU utilisation is likely a candidate for rightsizing to a smaller machine type. GCP provides two tools for this:

GCP Recommender analyses Compute Engine VM metrics over a 30-day rolling window and produces rightsizing recommendations. Each recommendation includes projected savings, the suggested machine type, and confidence based on the consistency of the usage pattern. Recommender recommendations are available in the console, via the gcloud recommender recommendations list command, and through the Recommender API for programmatic integration.

GKE Autopilot sidesteps the rightsizing problem for container workloads. Instead of billing per node VM, Autopilot bills per pod resource request (CPU and memory). GKE manages node provisioning automatically, and you are never paying for idle node capacity — you pay only for what your pods actually request.

Idle Resource Cleanup

Resources that are stopped but still exist in GCP continue to incur storage charges (persistent disks) and sometimes IP address charges (reserved static external IPs with no attached resource). Cloud Monitoring and Active Assist surface idle resource recommendations:

Budget Alerts and Billing Export

Cloud Billing Budgets allow you to define a spending threshold for a project, folder, or billing account. When actual or forecasted spend reaches a configured percentage (commonly 50%, 75%, 90%, 100%), an alert is sent via email or Pub/Sub. Pub/Sub integration enables automated responses — for example, a Cloud Function that disables certain APIs or creates a ticket in a project management system when the budget threshold is exceeded.

Billing export to BigQuery streams all billing data (line-item charges, credits, resource labels) into a BigQuery dataset. This enables SQL queries across the full cost history: which team’s workloads are most expensive, which labels account for what spend, and which services are trending up faster than expected. Cost allocation by label allows chargebacks to business units without requiring separate GCP projects per team.

Active Assist

Active Assist is GCP’s umbrella for machine learning-powered operational recommendations. Beyond VM rightsizing, it covers:

All Active Assist recommendations include an estimated monthly savings figure and a one-click apply option where the change can be safely automated.


Reliability and Cost: The Trade-Off

Reliability is not free. The following table illustrates how increasing the availability tier of a simple web application increases both complexity and cost:

Availability TargetArchitectureRelative Cost
~99.9%Single-zone MIG + regional Cloud SQL1x
~99.99%Regional MIG + Cloud SQL HA (multi-zone)~2x
~99.99% (global)Global LB + multi-region MIGs + Cloud SQL with cross-region replica~3–4x
99.999%Multi-region active/active + Cloud Spanner5x+

The right answer depends on the business value of the service. A marketing website and a payment processing engine do not require the same availability tier. Applying SRE discipline — defining explicit SLOs and measuring actual SLIs — provides the data needed to make these trade-off decisions objectively rather than based on intuition.