One Terraform state, or many? A practical answer

Every Terraform codebase reaches the same fork: keep one giant state file, or split it. The most popular split — by team — feels organised and is almost always wrong.

The classic monolith starts the same way everywhere. Day one: a single main.tf with a VPC and an RDS. Year two: 4,000 resources in one state, terraform plan takes nine minutes, and any change to a security group locks the file for everyone.

The instinct is to split. The instinct is right. The way most teams split is wrong.

Why "split by team" fails

The intuitive carve-up looks like the org chart:

terraform/
  backend-team/
  frontend-team/
  data-team/
  platform-team/

Three things break, in this order:

Shared resources need an owner. Whose state holds the VPC? The shared S3 bucket? The IAM role used by everyone's CI? Ownership disputes turn into PRs that sit for a week.
Cross-state references multiply. Backend's RDS subnet group needs Platform's VPC ID, so backend reads platform state via data "terraform_remote_state". Now backend's plan breaks every time platform's outputs change.
Reorgs corrupt the layout. The data team merges into platform. Now the directory structure is a fossil. You either rename (and rewrite every remote-state reference) or live with a lie.

The org chart changes every year. Infrastructure changes on a different cadence. Don't couple them.

Split by lifecycle

Group resources by how often they change, not who owns them. A workable default for most cloud accounts:

terraform/
  # Changes ≈ never. Manual approval to plan.
  bootstrap/         # the state bucket itself, IAM roots, org-level SCPs

  # Changes ≈ quarterly. Reviewed carefully.
  network/           # VPCs, subnets, transit gateways, DNS zones
  data/              # RDS, ElastiCache, S3 buckets that hold business data

  # Changes ≈ weekly. Auto-applied on merge.
  platform/          # EKS clusters, ECR repos, shared IAM roles

  # Changes ≈ daily. Auto-applied per app.
  apps/
    api/
    worker/
    web/

Why this works:

Blast radius matches risk. A bad apps/api apply can't accidentally drop your production database — different state, different blast zone.
Plan times stay sane. No layer holds more than a few hundred resources.
The dangerous layers move slowly on purpose. When data/ changes once a quarter, every change gets the attention it deserves.
Reorgs don't matter. The directory is a description of the system, not the team.

How the layers talk to each other

Lower layers expose outputs. Upper layers consume them as data sources.

# network/outputs.tf
output "vpc_id" {
  value = aws_vpc.main.id
}
output "private_subnet_ids" {
  value = aws_subnet.private[*].id
}

# platform/main.tf
data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "acme-tf-state"
    key    = "network/terraform.tfstate"
    region = "eu-south-2"
  }
}

module "eks" {
  source     = "./modules/eks"
  vpc_id     = data.terraform_remote_state.network.outputs.vpc_id
  subnet_ids = data.terraform_remote_state.network.outputs.private_subnet_ids
}

The dependency arrow only points downward — apps depend on platform, platform depends on network, network depends on bootstrap. Never the other way. If you're tempted to read apps state from platform, you've put something in the wrong layer.

Heuristic. If two resources almost always change together in the same PR, they belong in the same state. If they change on different schedules, split them — even if the same person owns both.

What about workspaces?

Use terraform workspace for environments (dev / staging / prod) within a single layer. Don't use it as a substitute for splitting layers — workspaces still share the same code, providers and lock file. Splitting states and using workspaces are orthogonal tools that solve different problems.

Tooling notes

Terragrunt earns its keep here — DRY backend config across layers, dependency graphs that warn before you delete an output another state still consumes.
Atlantis or Spacelift can scope plan-on-PR to only the changed directory, which keeps CI fast.
Don't use one S3 bucket per state. One bucket, multiple keys. Bucket-per-state turns into a permission nightmare.

Stuck mid-split, or starting from a 4,000-resource monolith? We've untangled a few of these — happy to talk it through.

One Terraform state, or many? A practical answer.

Why "split by team" fails

Split by lifecycle

How the layers talk to each other

What about workspaces?

Tooling notes