The cloud audit we run on every new engagement

Every new engagement starts the same way: one week, five dimensions, one page of findings. We've published the rubric so any team can self-run it before deciding whether they need outside help.

The first week with a new client is mostly listening. Show us the AWS console. Show us your repos. Show us the on-call rotation. Show us last quarter's incidents. By Friday we deliver one page — what's red, what's amber, what's green, and what we'd touch first if it were our system.

The rubric below is what we score against. None of it is novel. The point is having a single artifact you can revisit every six months and watch greens accumulate.

The five dimensions

1. Security posture
2. Cost & waste
3. IaC coverage
4. Observability & on-call
5. Resilience & recovery

For each, we score Red / Amber / Green against a small number of binary checks. Anything Red goes on page one of the report. Anything Amber gets a one-line note. Greens get a tick and we move on.

1. Security posture

Green if all of:

Multi-account AWS Organization (or equivalent on GCP/Azure). No human has root keys for the management account.
SSO via Identity Center / Workspace / Entra. Zero IAM users with long-lived credentials.
S3 / GCS / Blob: account-level public access blocked. No bucket exposes objects publicly without a documented reason.
CloudTrail (or equivalent) enabled in every account, logs in a separate account, immutable retention.
Secrets in a manager (Secrets Manager, Vault, Doppler). No secrets in env vars committed to repos.
Production database not directly reachable from public internet.

Red on any of: long-lived IAM access keys belonging to humans, public S3 buckets that nobody can explain, root account in active use, secrets in .env files in the repo.

2. Cost & waste

Green if all of:

Monthly bill is reviewed by at least one person who could cut it.
Cost Explorer shows no single line item > 30% of the bill that nobody can explain.
NAT Gateway, Data Transfer, and "Other" combined < 25% of bill.
Reserved Instances / Savings Plans cover steady-state compute, or there's a documented reason they don't.
Tags exist for environment, service, owner, and at least 80% of cost is attributable.
Non-prod environments have an off-hours shutdown story (even if it's "we use spot").

Red on: nobody owns the bill, top line is "EC2 — Other", non-prod runs 24/7 at full size, no tagging strategy at all.

3. IaC coverage

Green if all of:

Production infrastructure is > 90% in Terraform / Pulumi / CloudFormation. Click-ops is the documented exception, not the norm.
State is in remote backend with locking. Nobody runs terraform apply from a laptop into prod.
State is split into layers by lifecycle (see our earlier post). Not a single 4,000-resource state file.
CI runs plan on every PR; merges are gated on a clean plan.
You can rebuild a whole environment from scratch in < 4 hours from code alone.

Red on: a senior engineer is the only person who can deploy because half the resources are click-ops, state is in a single S3 bucket with no locking, you cannot answer "could we rebuild this from scratch" with yes.

4. Observability & on-call

Green if all of:

Logs from every production service land in one searchable place (CloudWatch + Athena, Datadog, Grafana Loki — pick one).
Metrics for the four golden signals (latency, traffic, errors, saturation) per service.
Alerts fire on user-visible problems, not on internal CPU thresholds. False-positive rate < 1 per week per on-call.
Documented on-call rotation with paging, escalation, and a runbook per alert.
At least one human can be paged at any hour. Phone, not Slack.
Dashboards exist that the CEO could open and understand.

Red on: nobody is paged out of hours, alerts go to an unread Slack channel, "monitoring" is reading kubectl logs when something breaks.

5. Resilience & recovery

Green if all of:

Documented RPO/RTO per critical service. Numbers, not adjectives.
Automated backups for every stateful service. Restore tested in the last 6 months.
Production runs in >1 AZ for stateless workloads. Stateful workloads have a documented failover plan.
Last incident has a written postmortem with at least one merged action item.
You can answer "what happens if AZ-A goes down right now?" without saying "we'd find out."

Red on: backups exist but have never been restored, no postmortems, single-AZ production, "DR plan" lives in someone's head.

The deliverable

One page. Five rows. Each row: Red / Amber / Green, two-line finding, one-line first action. Then a prioritised top-five list of what to fix in the next 90 days, with effort estimates.

That's the entire artifact. Not 40 pages. Not a slide deck. One page that the CTO can forward to the CEO and the board.

Why a single page. A 40-page report is a defence document — it justifies the consultant's fee and gets filed. A one-page report is a decision document — it gets argued about, prioritised, and acted on. We've never had a client wish ours was longer.

The 90-day plan

Whatever the audit surfaces, the follow-up plan never has more than five items. Brain capacity is finite. The two universal rules:

Fix one Red before chasing two Ambers. Reds are the things that turn into incidents.
Cost wins fund security work. If audit finds €4k/month of waste, the savings pay for the next two months of remediation work.

Self-running it

You can run this audit yourself today. It takes about a day with two senior engineers and access to the cloud consoles. The hardest part is being honest about the Reds — if you wrote the system being audited, you'll instinctively defend it. Ask someone outside the team to score it, even if they don't know your stack. The questions are mostly binary.

If you do, send us the result. We'll tell you which finding we'd prioritise — for free, no sales call attached.

If running it yourself isn't realistic, we do this audit fixed-scope, fixed-fee in 1 week. Drop us a note.

The cloud audit we run on every new engagement.

The five dimensions

1. Security posture

2. Cost & waste

3. IaC coverage

4. Observability & on-call

5. Resilience & recovery

The deliverable

The 90-day plan

Self-running it