Evaluating Claude's Terraform skills | Infinite Drive Field Notes

The setup

What I tested and why

Most DevOps engineers using Claude for Terraform work never stop to think about which skill is active. They install one, get reasonable results, and move on. I got curious about what the choice actually costs: in tokens, output quality, and the amount of correction work that ends up happening downstream.

I ran five representative infrastructure tasks across four configurations via the Anthropic API, recording input tokens, output tokens, and dollar cost per session. The four configurations were three MCP skills: terraform-skill by Anton Babenko, TerraShark by Lukas Niessen, and HashiCorp's official style guide skill, plus bare Claude with no skill active. That fourth configuration was the most important one: it answers whether any skill actually earns its overhead.

Tasks covered a range of real-world work, including a greenfield VPC module, an IAM role with least-privilege S3 access, an RDS PostgreSQL module, adding infracost support to an existing module, and refactoring flat inline code into a reusable module structure.

The configurations

Three different philosophies, plus a baseline

What became clear during testing is that these aren't three versions of the same thing; they represent genuinely different approaches to the problem of guiding Claude's Terraform output. All three are available through the Claude Code skill marketplace, terraform-skill and TerraShark are also on GitHub at antonbabenko/terraform-skill and LukasNiessen/terrashark.

terraform-skill

Anton Babenko · v1.6.0

1,680 avg input tokens

Structure-first

The most opinionated of the three. Enforces module hierarchy, naming conventions, and variable ordering. Brings structural guardrails; you don't prompt for them.

TerraShark

Lukas Niessen

666 avg input tokens

Risk-first

Focused on failure modes: identity churn, secret exposure, blast radius, CI drift, compliance gaps. A lighter footprint with a different philosophical lens, review over generation.

hashicorp-style-guide

HashiCorp official

852 avg input tokens

Style-first

HashiCorp's official conventions for formatting, naming, and file organisation. No strong opinions on security or module architecture, a style enforcer, not an architect.

No skill

Bare Claude · baseline

60 avg input tokens

Baseline

No system prompt. The honest answer to whether adding any skill is worth the overhead.

The cost

What each configuration costs per task

All sessions ran against Claude Sonnet 4.6 via the Anthropic API at $3.00 per million input tokens and $15.00 per million output tokens. These are averages across all five tasks.

terraform-skill
$0.048
avg per task

TerraShark

$0.057

avg per task

hashicorp-style-guide

$0.043

avg per task

No skill

$0.046

avg per task

The numbers in the cards above look close together, and they are, but the reason why is worth understanding. Skills add tokens at the start of every session, before you've typed a single prompt. You might expect a skill with 1,680 activation tokens to cost significantly more than bare Claude at 60. In practice it doesn't, because what drives cost is how much Claude writes back, not how much you put in. Output tokens are five times more expensive than input tokens, so a detailed response quickly swamps any activation overhead. That's also why TerraShark ends up the most expensive despite having the lightest activation footprint, its failure-mode workflow generates longer responses, with prose analysis and risk commentary alongside the HCL.

The practical implication depends on how you use Claude. On a Pro subscription, you're not paying per token, but you are spending from a session allowance that resets on a timer. A configuration that uses more tokens per task means fewer tasks before you hit that limit. If you do heavy Terraform work, multiple modules, or back-to-back sessions, the choice of skill affects how far your allowance stretches across the day. Run this in a CI pipeline or trigger it on every PR and the cost figures above become line items in your cloud spend, small per-task differences compound fast when you're doing hundreds of runs a day.

Results

Task by task

Task	terraform-skill	TerraShark	hashicorp	No skill
1 · VPC greenfield	FAIL ¹ $0.066	PASS $0.063	PASS $0.049	PASS $0.062
2 · IAM role	PASS $0.034	PASS $0.063	FAIL ² $0.029	PASS $0.049
3 · RDS greenfield	PASS $0.066	PASS $0.063	PASS $0.064	PASS $0.062
4 · infracost	PASS $0.038	PASS $0.041	PASS $0.032	PASS $0.032
5 · Refactor	PASS $0.034	PASS $0.055	PASS $0.027	PASS $0.028

PASS = terraform validate succeeded · FAIL = genuine HCL error

¹ terraform-skill Task 1: ternary operator spanning lines incorrectly in route table tags block · ² hashicorp-style-guide Task 2: reference to undeclared data.aws_ami and module.iam

The findings

What the data actually showed

Task 1 — prompt precision: All four configurations produced complete, production-quality VPC modules. No-skill went beyond the brief and added VPC Flow Logs unprompted, CloudWatch log group, IAM role, and flow log resource. The clearest reminder in the evaluation is that a well-crafted prompt to bare Claude is often all you need.

Task 2 — IAM quality: The sharpest quality differences of the evaluation. TerraShark's output was the most security-hardened by a clear margin — it included an explicit deny block for write and delete actions, a confused-deputy guard on the trust policy, force_detach_policies on the role, and a post-apply check block to verify the role exists after apply. None of the other three configurations produced these controls unprompted. This is TerraShark doing exactly what it was designed for: failure-mode awareness surfacing as concrete security controls in generated code.

Task 3 — RDS greenfield: All four configurations passed and produced complete, working RDS modules. TerraShark stood out on variable validation: it enforced RFC-1918 CIDR ranges, validated AWS region format, and required explicit availability zone specification to prevent identity churn if default AZ ordering ever changes. Small details that reflect its failure-mode design show up even on a straightforward generation task.

Task 4 — infracost: All four configurations correctly understood that infracost requires no new Terraform resources. Each responded with cost-aware tagging, infracost-specific comments, and usage variable patterns, the right approach for making a module more infracost-friendly. No-skill explicitly noted that CLI invocations can't be expressed in HCL. The cleanest result of the evaluation: consistent, accurate output across all four configurations.

Task 5 — wildcard IAM: The starting module contained a deliberate security flaw — s3:* on Resource = "*". Three configurations carried it forward silently: no-skill, terraform-skill, and hashicorp-style-guide all refactored the code without flagging the overly permissive policy. TerraShark was the only configuration that caught it; it surfaced the wildcard as an explicit variable with a BLAST RADIUS WARNING comment and instructed the caller to override it for production. It didn't silently fix it, which is arguably more honest: the caller needs to make that decision consciously.

The baseline: Bare Claude — no skill, 60 input tokens, no overhead, produced clean production-quality output across all five tasks. It chose multi-module structures unprompted on the more complex tasks, which is the right architectural call for production code. Where it fell short was security depth: TerraShark produced more hardened IAM output on Task 2 and more rigorous variable validation on Task 3. For infrastructure generation, running Claude with no skill active is not a fallback. It's a legitimate default, with the understanding that it won't apply failure-mode thinking unless you ask for it.

On correction work: Two configurations produced output that required correction before it could be used: terraform-skill on Task 1 had a genuine HCL syntax error in the route table tags block, and hashicorp-style-guide on Task 2 referenced an undeclared resource. TerraShark on Task 5 is a different category — it didn't produce broken code, it produced a conscious decision point around the wildcard IAM policy that required the caller to act. No-skill required no correction across any task where the output was complete. If minimising downstream rework is the primary concern, the baseline is the safest choice.

The call

What I'd actually do

Decision framework

Default No skill with a clear prompt for most generation tasks. Output quality was competitive across the board and it made good architectural decisions unprompted. You'll get further faster if your prompt is precise, but that's true of any configuration.

Greenfield modules terraform-skill when you want opinionated structure enforced automatically like naming conventions, variable ordering, module hierarchy. Worth the 1,680-token overhead when the alternative is prompting for those conventions yourself every session.

Code review TerraShark when reviewing or refactoring existing infrastructure. Its failure-mode framing surfaced a deliberately embedded wildcard IAM flaw that three other configurations missed entirely. It's a liability on greenfield generation tasks where you need complete output, but on review and refactor tasks it does exactly what it was designed to do.

Style enforcement hashicorp-style-guide when formatting consistency and HashiCorp conventions are the primary concern. It produced the most complete greenfield VPC module on Task 1 and passed cleanly on RDS. A reliable generation skill, just don't expect it to think about security.

The takeaway

A framework, not a ranking

I started this because I was curious about cost. Which skill was the cheapest to run? Was the token overhead worth it? Those felt like the right questions going in.

They turned out to be the least interesting findings. The cost differences between configurations are small enough that they only matter at scale or in automation. What I didn't expect was how clearly the evaluation would reveal the character of each skill, not just whether it produced valid HCL, but how it thinks about infrastructure problems.

TerraShark caught a deliberate security flaw that three other configurations missed entirely. terraform-skill consistently generated well-structured, opinionated modules. hashicorp-style-guide produced the cleanest greenfield VPC on the first attempt. Bare Claude delivered production-quality output at the lowest cost across every task.

None of those is a reason to always use one and never use the others. They're tools with different strengths, and the engineers who'll get the most out of them are the ones who pick deliberately rather than defaulting to whatever was installed first.

If I had to reduce it to one sentence: use no-skill as your default, reach for terraform-skill when structure matters, and load TerraShark when you're working with code you didn't write.

The prompts

What I asked

Same prompt, four configurations, fresh session each time. Tasks 4 and 5 included a starting module pasted at the end of the prompt.

+ View all five prompts

Task 1 — VPC greenfield

Create a Terraform module for a VPC on AWS with public and private subnets across two availability zones, a NAT gateway, and appropriate route tables. Use variables for the VPC CIDR, environment name, and region. Follow current best practices.

Task 2 — IAM role

Create a Terraform resource for an IAM role that allows an EC2 instance to read from a specific S3 bucket. The bucket name should be a variable. Apply least-privilege principles — no wildcards on actions or resources.

Task 3 — RDS greenfield

Create a Terraform module for an RDS PostgreSQL instance in a private subnet. Include a subnet group, parameter group, and security group that allows access only from a specified CIDR. Use variables for instance class, storage, database name, and environment.

Task 4 — infracost augmentation

Update this Terraform module to support infracost cost estimation. Add any required Terraform configuration changes only — do not generate CI/CD pipelines, GitHub Actions workflows, or shell scripts. Here is the existing module:

Starting module

provider "aws" {
  region = "us-east-1"
}

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  tags = { Name = "main-vpc" }
}

resource "aws_subnet" "public" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "us-east-1a"
  tags = { Name = "public-subnet" }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
  tags = { Name = "main-igw" }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

resource "aws_route_table_association" "public" {
  subnet_id      = aws_subnet.public.id
  route_table_id = aws_route_table.public.id
}

Task 5 — refactor

Refactor this Terraform code into a reusable module with a clean variable interface. Do not change what the infrastructure does — only the structure. Here is the existing code:

Starting module — note the deliberate s3:* wildcard on line 27

provider "aws" {
  region = "us-west-2"
}

resource "aws_s3_bucket" "data" {
  bucket = "my-company-data-bucket"
}

resource "aws_s3_bucket_versioning" "data" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_iam_role" "app" {
  name = "app-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "ec2.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy" "app" {
  name = "app-policy"
  role = aws_iam_role.app.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = "s3:*"
      Resource = "*"
    }]
  })
}

resource "aws_instance" "app" {
  ami                  = "ami-0c55b159cbfafe1f0"
  instance_type        = "t3.medium"
  iam_instance_profile = aws_iam_instance_profile.app.name
  tags = { Name = "app-server" }
}

resource "aws_iam_instance_profile" "app" {
  name = "app-profile"
  role = aws_iam_role.app.id
}

One caveat

What this evaluation can and can't tell you

I ran this via the Anthropic API against Claude Sonnet 4.6 in May 2026. The cost figures and token counts are accurate for that model and pricing, both will drift as Anthropic updates pricing and as skill authors update their content.

What should hold longer than any specific number: the character of each skill. A structure-first tool, a failure-mode tool, and a style tool have different instincts that don't change with a version bump. The wildcard IAM finding and the IAM security depth findings reflect those instincts directly, and those are the findings worth carrying forward.

Questions about this evaluation or how I approached it? Get in touch.

Evaluating Claude's Terraform skills: token cost, output quality, and when to use none of them

What I tested and why

Three different philosophies, plus a baseline

What each configuration costs per task

Task by task

What the data actually showed

What I'd actually do

A framework, not a ranking

What I asked

What this evaluation can and can't tell you